Skip to main content Link Search Menu Expand Document (external link)
Last updated: | Permalink

Reading List

Being less concrete further out, the reading list is being incrementally updated to include more papers as we go.

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Efficient Memory Management for Large Language Model Serving with PagedAttention [ACM SOSP 2023]

InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management [USENIX OSDI 2024]

Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve [USENIX OSDI 2024]

Understanding Stragglers in Large Model Training Using What-if Analysis [USENIX OSDI 2025]

DSpark: Confidence-Scheduled Speculative Decoding with Semi=Autoregressive Generation


© 2026 Yue Cheng. Released under the CC BY-SA license