CS 675 Distributed Systems (Spring 2020)

Reading List

Big Data Systems

MapReduce: Simplified Data Processing on Large Clusters [USENIX OSDI 2004]

The Google File System [ACM SOSP 2003] (optional)

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing [USENIX NSDI 2012]

The RAMCloud Storage System [ACM Transactions on Computer Systems] (optional)

The Case for RAMClouds: Scalable High-Performance Storage Entirely in DRAM [ACM SIGOPS Operating Systems Review] (optional)

Dryad: Distributed Data-Processing Programs from Sequential Building Blocks [ACM EuroSys 2007] (optional)

Distributed Consensus

Paxos made simple [ACM SIGACT News 2001]

In Search of an Understandable Consensus Algorithm [USENIX ATC 2014]

Paxos Made Live - An Engineering Perspective (optional)

The Chubby Lock Service for Loosely-Coupled Distributed Systems [USENIX OSDI 2006] (optional)

ZooKeeper: Wait-free coordination for Internet-scale systems [USENIX ATC 2010] (optional)

Serverless Computing

Cloud Programming Simplified: A Berkeley View on Serverless Computing [Technical report]

InfiniCache: Exploiting Ephemeral Serverless Functions to Build a Cost-Effective Memory Cache [USENIX FAST 2020]

In Search of a Fast and Efficient Serverless DAG Engine [PDSW 2019]

Serverless Computation with OpenLambda [USENIX HotCloud 2016] (optional)

Firecracker: Lightweight Virtualization for Serverless Applications [USENIX NSDI 2020] (optional)

Occupying the Cloud: Distributed Computing for the 99% [ACM SoCC 2017] (optional)

Encoding, Fast and Slow: Low-Latency Video Processing Using Thousands of Tiny Threads [USENIX NSDI 2017] (optional)

Large-Scale Distributed Storage

Dynamo: Amazon’s Highly Available Key-value Store [ACM SOSP 2007]

Scaling Memcache at Facebook [USENIX NSDI 2013]

Distributed Machine Learning Systems

Ray: A Distributed Framework for Emerging AI Applications [USENIX OSDI 2018]

Scaling Distributed Machine Learning with the Parameter Server [USENIX OSDI 2014] (optional)

TensorFlow: A System for Large-Scale Machine Learning [USENIX OSDI 2016] (optional)

MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems (optional)

Datacenter Scheduling (optional)

Large-scale cluster management at Google with Borg [ACM EuroSys 2015]

Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center [USENIX NSDI 2011]

The Datacenter as a Computer (optional)