Staff

Faculty

Roxana Geambasu (Instructor)
OH by appointment
Sambit Sahu (Instructor)
OH by appointment
Eugene Wu (Instructor)
5-6pm Tues in 421 Mudd

Instructional Assistants
OH in TA/CA lounge

Chris Rusnak Sat 3-4PM
Chenqin Xu Thurs 4-5PM
Qiao Zhang Fri 4-5PM

Information

Tues, 7-9:30PM
309 Havemeyer Hall
Policies
Piazza
Project

Prerequisites

Required: Python
Recommended: Scala
Optional: Java

Assignments

See Courseworks as well. See assignment policies. Assignments due 11:59PM EST of due date.

Data Analysis. Due 2/11
Entity Resolution contest. TBA
Graph analysis. TBA

Grading

Homeworks: 60%
Test/Quiz: 30%
Class participation: 10%
Project: 0-40% extra credit

Textbooks

Helpful but not required:

Learning Spark.
ISBN-10: 1449358624
Advanced Analytics with Spark: Patterns for Learning from Data at Scale.
ISBN-10: 1491912766

Class Overview

An introduction to large-scale distributed systems with an emphasis on big-data processing and storage infrastructures. Topics include fundamental tradeoffs in distributed systems, techniques for exploiting parallelism, big-data computation and storage models, design and implementation of various well-known distributed systems infrastructures, and concrete exposure to programming big-data applications on top popular, open-source infrastructures for data processing and storage systems.

The course is co-taught by Roxana Geambasu (Associate Prof. in CS), Sambit Sahu (IBM TJ Watson researcher and CS affiliated faculty), and Eugene Wu (Assistant Prof. in CS). Geambasu will teach fundamental concepts of distributed systems, along with the tradeoffs that arise. Sahu will teach various distributed computation models, along with concrete examples of open-source big-data technologies and how they can be programmed. Wu will teach concepts of data modeling, storage, and visualization, along with the tradeoffs they raise.

Updates

Extended extra credit project report deadline to 11:59PM 5/2
Added student-written lecture notes for:
- relational model. Direct your thanks to Kathy Lin (kl2615), Xiaohui Guo (xg2225), Aria Kumar (sk4345)
- query processing + optimization. Direct your thanks to Yijia Chen (yc3425), Haotian Zeng (hz2494), Yiwen Zhang (yz3310)
Added non-graded quiz solutions
Added link to non-graded quiz

Schedule

1/16: Introduction (all)
- Why not single machine?
- Big-data challenges, datacenter structure, typical use cases and their requirements.
- Course overview.
1/23: Data Models and Cleaning (Wu)
- Why the relational data model? Why schemas? The ins and outs.
- Optional Readings: Will introduce these in class
  - What goes around comes around
  - Unified Logging@Twitter
- Non-graded Quiz
1/30: Cleaning and Integration (Wu)
- Optional Readings:Will introduce these in class
  - Truth finding on the deep web
  - Data Wrangler
2/06: Classic Query Processing and Fast Query Processing (Wu)
- Optional Reading:Will introduce these in class
2/13: Potourri (Wu)
- Graph analysis/Scalable vis/ML
- Optional Readings:Will introduce these in class
2/20: Scaling and fault tolerance: challenges and techniques (Geambasu)
- Failure models in large-scale distributed systems, consistency and coherence challenges, scaling challenges
- Sharding and replication as the key techniques for scaling and fault tolerance.
- Assigned reading: RFC 667, 1975
2/27: Distributed transactions on sharded databases (Geambasu)
- Two-phase locking, write-ahead logs
- Two-phase commit
3/06: Replication architectures and protocols (Geambasu)
- Primary-secondary architectures, chain replication, leader election protocols
- Consistency and the ordering of events in a distributed system. Importance of time.
3/13: NO CLASS. Spring Break!
3/20: The design and implementation of a scalable, fault tolerant storage system: Google’s Spanner (Geambasu)
- Reading/video: Spanner
3/27: Distributed computing models (Sahu)
- HDFS, Hadoop basics
- MapReuce Programming Model
- AWS EMR and MapReduce
4/03: Batch processing (Sahu)
- MapReduce programming with more complex examples
- HBase, Hive
4/10: Iterative and Stream Processing (Sahu)
- Intro to Spark, Big Data Architecture and Spark
- Spark programming
4/17: FINAL QUIZ
4/24: Advanced Spark Programming (Sahu)
- Advanced Spark Programming
- SparkQL, Spark Streaming, Machine Learning with Spark
- End-to-end Intelligent System design with Spark