Last updated: | Permalink
DS5110, Spring’25: Big Data Systems
Week 3 Announcement
- Lecture 5’s slides are posted.
- Assignment 1 is out!
Overview
Welcome to the course of Big Data Systems. Scalable big data systems are a central part of modern data science. This course will cover topics including design and use of parallel dataflow systems (MapReduce/Hadoop and Spark), scalable and parallel Python analytics frameworks, machine learning systems (Ray), and cloud data systems (cloud storage, large ML infrastructure). A major component of this course is hands-on programming using scalable analytics tools and cloud resources on Amazon Web Services (AWS) or Google Cloud.
Lecture Info
- Instructor: Yue Cheng
- Meeting time: TuTh 2:00 pm - 3:15pm
- Location: Data Science Building Room 305
Topics (tentative)
- Basic of computer and data systems, principles of parallel and distributed computing
- Google’s big data infrastructures (MapReduce, Google File System)
- Apache Spark
- Parallel Python analytics
- Machine learning systems (Ray, LLM)
- Cloud computing
- Serverless computing
- Large-scale cloud storage systems (Amazon Dynamo, AWS S3/DynamoDB)
- AI/ML platforms (Hugging Face)
Prerequisite
- All students should be comfortable with programming in one of the following programming languages: Python, Java, Go, C/C++. This is a strong requirement as DS 5110/CS5501 features hands-on programming.
- That said, being comfortable with Python is strongly recommended as all the programming assignments will be done using Python. Having some experience in Java, Go, C/C++ is a big plus!