Note: This entire website is still under construction. Anything can change until the beginning of the semester, so check again soon.

Class Overview





An introduction to large-scale distributed systems with an emphasis on big-data processing and storage infrastructures. Topics include fundamental tradeoffs in distributed systems, techniques for exploiting parallelism, big-data computation and storage models, design and implementation of various well-known distributed systems infrastructures, and concrete exposure to programming big-data applications on top popular, open-source infrastructures for data processing and storage systems.

The course is co-taught by Sambit Sahu (IBM TJ Watson researcher and CS affiliated faculty), and Eugene Wu (Assistant Prof. in CS). Sahu will teach fundamental concepts of distributed systems, along with the tradeoffs that arise, and various distributed computation models, along with concrete examples of open-source big-data technologies and how they can be programmed. Wu will teach concepts of data modeling, storage, and visualization, along with the tradeoffs they raise.


For assignments, you allowed 5 penalty free late days to use throughout the semester. One late day equals one 24 hour period after the due date of the assignment. Once you have used your late days, there will be a 20% penalty for each day an assignment is late. You do not need to explictly declare the use of late days; we will assign them to you in a way that is optimal for your grade when different assignments are worth different numbers of points. Late days may not be used for the final project.


The grading formula is:

A project can be done in lieu of any 3 assignments. If you do both the project and all assignments, then project is treated as extra credit equivalent to three assignments. Extra credit is added after any curves, so it does not hurt any students that choose to not do the project.



We’re recommending the following two reference texts, which cover some of the big-data technologies with which students will be interacting in the course:

  1. Learning Spark. Publisher: O’Reilly Media; 1 edition (February 27, 2015). Language: English. ISBN-10: 1449358624. ISBN-13: 978-1449358624.

  2. Advanced Analytics with Spark: Patterns for Learning from Data at Scale. Publisher: O’Reilly Media; 1 edition (April 20, 2015). Language: English. ISBN-10: 1491912766. ISBN-13: 978-1491912768.