COLUMBIA UNIVERSITY DSI W4121

Staff

Information

Prerequisites

Assignments

See Courseworks. See assignment policies. Assignments due 11:59PM EST of due date.

Grading

Textbooks

Helpful but not required:

  1. Learning Spark.
    ISBN-10: 1449358624
  2. Advanced Analytics with Spark: Patterns for Learning from Data at Scale.
    ISBN-10: 1491912766

Class Overview

An introduction to large-scale distributed systems with an emphasis on big-data processing and storage infrastructures. Topics include fundamental tradeoffs in distributed systems, techniques for exploiting parallelism, big-data computation and storage models, design and implementation of various well-known distributed systems infrastructures, and concrete exposure to programming big-data applications on top popular, open-source infrastructures for data processing and storage systems.

The course is co-taught by Roxana Geambasu (Associate Prof. in CS), Sambit Sahu (IBM TJ Watson researcher and CS affiliated faculty), and Eugene Wu (Assistant Prof. in CS). Geambasu will teach fundamental concepts of distributed systems, along with the tradeoffs that arise. Sahu will teach various distributed computation models, along with concrete examples of open-source big-data technologies and how they can be programmed. Wu will teach concepts of data modeling, storage, and visualization, along with the tradeoffs they raise.

Schedule