Last updated: | Permalink
Reading List
Being less concrete further out, the reading list is being incrementally updated to include more papers as we go.
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Efficient Memory Management for Large Language Model Serving with PagedAttention [ACM SOSP 2023]
InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management [USENIX OSDI 2024]
Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve [USENIX OSDI 2024]
Understanding Stragglers in Large Model Training Using What-if Analysis [USENIX OSDI 2025]
DSpark: Confidence-Scheduled Speculative Decoding with Semi=Autoregressive Generation