Summer 2026

DS5110: Big Data Systems

Scalable data systems, parallel analytics, cloud infrastructure, and modern AI systems/infrastructure

Latest Announcement

Announcements

May 4, 2026

Lecture materials will be posted here

Slides, readings, demos, and lab links will be added to the Materials tab as they become available.

Overview

Systems for modern data science and AI

Welcome to the course of Big Data Systems. Scalable big data systems are a central part of modern data science. This course will cover topics including design and use of parallel dataflow systems (MapReduce/Hadoop and Spark), scalable and parallel Python analytics frameworks, machine learning systems (Ray), and cloud data systems (cloud storage, large ML infrastructure). A major component of this course is hands-on programming using scalable analytics tools and cloud resources on Amazon Web Services (AWS) or Google Cloud.

A major component is hands-on programming using scalable analytics tools and cloud resources on AWS or Google Cloud.

Course Topics

  • Computer systems foundations and distributed computing
  • Google File System, HDFS, and MapReduce
  • Apache Spark and parallel dataflow systems
  • Parallel Python analytics and Ray
  • Cloud computing, storage, and serverless systems
  • Large-scale AI and LLM systems

Syllabus

Course policies and expectations

Calendar

May 18 to June 12, Monday through Friday

Date Topic Materials Notes

Lecture Materials

Slides, readings, and demos

Assignments

Coursework and deadlines

Staff

Instructor and course support