DéjàVu: KV-cache Streaming for Fast, Fault-tolerant Generative LLM Serving (2403.01876v1)
Abstract: Distributed LLM serving is costly and often underutilizes hardware accelerators due to three key challenges: bubbles in pipeline-parallel deployments caused by the bimodal latency of prompt and token processing, GPU memory overprovisioning, and long recovery times in case of failures. In this paper, we propose D\'ej`aVu, a system to address all these challenges using a versatile and efficient KV cache streaming library (D\'ej`aVuLib). Using D\'ej`aVuLib, we propose and implement efficient prompt-token disaggregation to reduce pipeline bubbles, microbatch swapping for efficient GPU memory management, and state replication for fault-tolerance. We highlight the efficacy of these solutions on a range of large models across cloud deployments.
- Sarathi: Efficient llm inference by piggybacking decodes with chunked prefills, 2023.
- Deepspeed inference: Enabling efficient inference of transformer models at unprecedented scale, 2022.
- Boost. Boost.asio. https://www.boost.org/doc/libs/1_78_0/doc/html/boost_asio.html, 2021.
- Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
- Punica: Multi-tenant lora serving, 2023.
- Longnet: Scaling transformers to 1,000,000,000 tokens, 2023.
- Get more with less: Synthesizing recurrence with kv cache compression for efficient llm inference, 2024.
- Check-N-Run: a checkpointing system for training deep learning recommendation models. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22), pages 929–943, Renton, WA, April 2022. USENIX Association. ISBN 978-1-939133-27-4. URL https://www.usenix.org/conference/nsdi22/presentation/eisenman.
- Github. Github copilot. https://github.com/features/copilot, 2023.
- Analysis of Large-Scale Multi-Tenant GPU clusters for DNN training workloads. In 2019 USENIX Annual Technical Conference (USENIX ATC 19), pages 947–960, Renton, WA, July 2019. USENIX Association. ISBN 978-1-939133-03-8. URL https://www.usenix.org/conference/atc19/presentation/jeon.
- Hexgen: Generative inference of large-scale foundation model over heterogeneous decentralized environment, 2024.
- S33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT: Increasing gpu utilization during generative inference for higher throughput, 2023.
- Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, SOSP ’23, page 611–626, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9798400702297. doi: 10.1145/3600006.3613165. URL https://doi.org/10.1145/3600006.3613165.
- Harmony: overcoming the hurdles of gpu memory capacity to train massive dnn models on commodity servers. Proc. VLDB Endow., 15(11):2747–2760, jul 2022. ISSN 2150-8097. doi: 10.14778/3551793.3551828. URL https://doi.org/10.14778/3551793.3551828.
- Spotserve: Serving generative large language models on preemptible instances, 2023.
- Pipedream: Generalized pipeline parallelism for dnn training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, SOSP ’19, page 1–15, New York, NY, USA, 2019. Association for Computing Machinery. ISBN 9781450368735. doi: 10.1145/3341301.3359646. URL https://doi.org/10.1145/3341301.3359646.
- Efficient large-scale language model training on gpu clusters using megatron-lm, 2021.
- NVIDIA. Cuda c/c++ streams and concurrency. https://developer.download.nvidia.com/CUDA/training/StreamsAndConcurrencyWebinar.pdf, 2015.
- NVIDIA. Nvidia collective communications library (nccl). https://developer.nvidia.com/nccl, 2023a.
- NVIDIA. Nvidia fastertransformer. https://github.com/NVIDIA/FasterTransformer, 2023b.
- NVIDIA. Tensorrt-llm. https://github.com/NVIDIA/TensorRT-LLM, 2023c.
- OpenAI. Openai developer platform. https://platform.openai.com/overview, 2023.
- OpenMPI. Open mpi: Open source high performance computing. https://www.open-mpi.org/, 2023.
- fairseq: A fast, extensible toolkit for sequence modeling. In Waleed Ammar, Annie Louis, and Nasrin Mostafazadeh, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pages 48–53, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-4009. URL https://aclanthology.org/N19-4009.
- Splitwise: Efficient generative llm inference using phase splitting, 2023.
- Zero-infinity: breaking the gpu memory wall for extreme scale deep learning. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’21, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450384421. doi: 10.1145/3458817.3476205. URL https://doi.org/10.1145/3458817.3476205.
- S-lora: Serving thousands of concurrent lora adapters, 2023a.
- Flexgen: High-throughput generative inference of large language models with a single gpu, 2023b.
- Wikipedia. Pci express. https://en.wikipedia.org/wiki/PCI_Express, 2023.
- BigScience Workshop. Bloom: A 176b-parameter open-access multilingual language model, 2023.
- Orca: A distributed serving system for Transformer-Based generative models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 521–538, Carlsbad, CA, July 2022. USENIX Association. ISBN 978-1-939133-28-1. URL https://www.usenix.org/conference/osdi22/presentation/yu.
- Poseidon: An efficient communication architecture for distributed deep learning on GPU clusters. In 2017 USENIX Annual Technical Conference (USENIX ATC 17), pages 181–193, Santa Clara, CA, July 2017. USENIX Association. ISBN 978-1-931971-38-6. URL https://www.usenix.org/conference/atc17/technical-sessions/presentation/zhang.
- Opt: Open pre-trained transformer language models, 2022.
- H22{}_{2}start_FLOATSUBSCRIPT 2 end_FLOATSUBSCRIPTo: Heavy-hitter oracle for efficient generative inference of large language models, 2023.
- Lmsys-chat-1m: A large-scale real-world llm conversation dataset, 2023.
- Distserve: Disaggregating prefill and decoding for goodput-optimized large language model serving, 2024.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.