Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 71 tok/s
Gemini 2.5 Pro 52 tok/s Pro
GPT-5 Medium 18 tok/s Pro
GPT-5 High 15 tok/s Pro
GPT-4o 101 tok/s Pro
Kimi K2 196 tok/s Pro
GPT OSS 120B 467 tok/s Pro
Claude Sonnet 4 37 tok/s Pro
2000 character limit reached

DéjàVu: KV-cache Streaming for Fast, Fault-tolerant Generative LLM Serving (2403.01876v1)

Published 4 Mar 2024 in cs.DC

Abstract: Distributed LLM serving is costly and often underutilizes hardware accelerators due to three key challenges: bubbles in pipeline-parallel deployments caused by the bimodal latency of prompt and token processing, GPU memory overprovisioning, and long recovery times in case of failures. In this paper, we propose D\'ej`aVu, a system to address all these challenges using a versatile and efficient KV cache streaming library (D\'ej`aVuLib). Using D\'ej`aVuLib, we propose and implement efficient prompt-token disaggregation to reduce pipeline bubbles, microbatch swapping for efficient GPU memory management, and state replication for fault-tolerance. We highlight the efficacy of these solutions on a range of large models across cloud deployments.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. Sarathi: Efficient llm inference by piggybacking decodes with chunked prefills, 2023.
  2. Deepspeed inference: Enabling efficient inference of transformer models at unprecedented scale, 2022.
  3. Boost. Boost.asio. https://www.boost.org/doc/libs/1_78_0/doc/html/boost_asio.html, 2021.
  4. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
  5. Punica: Multi-tenant lora serving, 2023.
  6. Longnet: Scaling transformers to 1,000,000,000 tokens, 2023.
  7. Get more with less: Synthesizing recurrence with kv cache compression for efficient llm inference, 2024.
  8. Check-N-Run: a checkpointing system for training deep learning recommendation models. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22), pages 929–943, Renton, WA, April 2022. USENIX Association. ISBN 978-1-939133-27-4. URL https://www.usenix.org/conference/nsdi22/presentation/eisenman.
  9. Github. Github copilot. https://github.com/features/copilot, 2023.
  10. Analysis of Large-Scale Multi-Tenant GPU clusters for DNN training workloads. In 2019 USENIX Annual Technical Conference (USENIX ATC 19), pages 947–960, Renton, WA, July 2019. USENIX Association. ISBN 978-1-939133-03-8. URL https://www.usenix.org/conference/atc19/presentation/jeon.
  11. Hexgen: Generative inference of large-scale foundation model over heterogeneous decentralized environment, 2024.
  12. S33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT: Increasing gpu utilization during generative inference for higher throughput, 2023.
  13. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, SOSP ’23, page 611–626, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9798400702297. doi: 10.1145/3600006.3613165. URL https://doi.org/10.1145/3600006.3613165.
  14. Harmony: overcoming the hurdles of gpu memory capacity to train massive dnn models on commodity servers. Proc. VLDB Endow., 15(11):2747–2760, jul 2022. ISSN 2150-8097. doi: 10.14778/3551793.3551828. URL https://doi.org/10.14778/3551793.3551828.
  15. Spotserve: Serving generative large language models on preemptible instances, 2023.
  16. Pipedream: Generalized pipeline parallelism for dnn training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, SOSP ’19, page 1–15, New York, NY, USA, 2019. Association for Computing Machinery. ISBN 9781450368735. doi: 10.1145/3341301.3359646. URL https://doi.org/10.1145/3341301.3359646.
  17. Efficient large-scale language model training on gpu clusters using megatron-lm, 2021.
  18. NVIDIA. Cuda c/c++ streams and concurrency. https://developer.download.nvidia.com/CUDA/training/StreamsAndConcurrencyWebinar.pdf, 2015.
  19. NVIDIA. Nvidia collective communications library (nccl). https://developer.nvidia.com/nccl, 2023a.
  20. NVIDIA. Nvidia fastertransformer. https://github.com/NVIDIA/FasterTransformer, 2023b.
  21. NVIDIA. Tensorrt-llm. https://github.com/NVIDIA/TensorRT-LLM, 2023c.
  22. OpenAI. Openai developer platform. https://platform.openai.com/overview, 2023.
  23. OpenMPI. Open mpi: Open source high performance computing. https://www.open-mpi.org/, 2023.
  24. fairseq: A fast, extensible toolkit for sequence modeling. In Waleed Ammar, Annie Louis, and Nasrin Mostafazadeh, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pages 48–53, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-4009. URL https://aclanthology.org/N19-4009.
  25. Splitwise: Efficient generative llm inference using phase splitting, 2023.
  26. Zero-infinity: breaking the gpu memory wall for extreme scale deep learning. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’21, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450384421. doi: 10.1145/3458817.3476205. URL https://doi.org/10.1145/3458817.3476205.
  27. S-lora: Serving thousands of concurrent lora adapters, 2023a.
  28. Flexgen: High-throughput generative inference of large language models with a single gpu, 2023b.
  29. Wikipedia. Pci express. https://en.wikipedia.org/wiki/PCI_Express, 2023.
  30. BigScience Workshop. Bloom: A 176b-parameter open-access multilingual language model, 2023.
  31. Orca: A distributed serving system for Transformer-Based generative models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 521–538, Carlsbad, CA, July 2022. USENIX Association. ISBN 978-1-939133-28-1. URL https://www.usenix.org/conference/osdi22/presentation/yu.
  32. Poseidon: An efficient communication architecture for distributed deep learning on GPU clusters. In 2017 USENIX Annual Technical Conference (USENIX ATC 17), pages 181–193, Santa Clara, CA, July 2017. USENIX Association. ISBN 978-1-931971-38-6. URL https://www.usenix.org/conference/atc17/technical-sessions/presentation/zhang.
  33. Opt: Open pre-trained transformer language models, 2022.
  34. H22{}_{2}start_FLOATSUBSCRIPT 2 end_FLOATSUBSCRIPTo: Heavy-hitter oracle for efficient generative inference of large language models, 2023.
  35. Lmsys-chat-1m: A large-scale real-world llm conversation dataset, 2023.
  36. Distserve: Disaggregating prefill and decoding for goodput-optimized large language model serving, 2024.
Citations (13)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces efficient KV-cache streaming with prompt-token disaggregation, microbatch swapping, and state replication to improve latency and fault tolerance.
  • The microbatch swapping technique dynamically manages GPU memory, enabling up to 1.8× throughput improvement by reducing memory footprint.
  • The state replication mechanism ensures rapid recovery from failures, significantly minimizing downtime in distributed LLM operations.

DéjàVu: Enhancing Generative LLM Serving with KV-cache Streaming

Introduction to \dejavu

The advent of distributed serving for LLMs like GPT-3, OPT, and BLOOM has transformed numerous applications, including chatbots, code generation, and text summarization. However, the underlying infrastructure often encounters inefficiencies, such as hardware underutilization, overprovisioned GPU memory, and long recovery periods post-failure. Addressing these challenges, the paper presents \dejavu, a novel system designed to ameliorate these inefficiencies through an efficient Key-Value (KV) cache streaming library, termed \dejavulib. This system introduces prompt-token disaggregation, microbatch swapping, and state replication to foster a fast, scalable, and fault-tolerant environment for generative LLM serving.

Disaggregating Prompt Processing from Token Generation

A significant latency disparity exists between prompt processing and token generation, contributing to pipeline inefficiencies in distributed LLM serving. \dejavu addresses this by disaggregating prompt processing from token generation, allocating distinct resources to each task to match their respective demands efficiently. This segregation aids in minimizing stall times or 'pipeline bubbles', thereby maximizing GPU utilization. The implementation of this approach leverages \dejavulib for efficient KV cache management, ensuring prompt data is swiftly and seamlessly transferred between the disaggregated tasks.

Microbatch Swapping for GPU Memory Management

Conventional systems tend to overprovision GPU memory by preallocating memory space for KV caches of all microbatches, despite their sequential processing. \dejavu introduces a novel approach called microbatch swapping, which dynamically manages GPU memory by swapping KV cache states between the GPU and the CPU. This strategy significantly reduces the GPU memory footprint, allowing for higher batch processing capacities and better resource utilization.

State Replication for Fault Tolerance

The paper identifies the stateful nature of LLM serving—inherent due to KV caching—as a vulnerability, particularly in distributed setups prone to hardware or software failures. \dejavu enhances fault tolerance through state replication, ensuring that KV cache states are replicated across different nodes. This approach minimizes the redundant computations required to restore lost states upon failures, thus reducing recovery times and improving overall reliability.

Evaluating \dejavu: Findings and Implications

The empirical evaluation of \dejavu highlights its effectiveness across different metrics. When compared to existing LLM serving systems, \dejavu demonstrates up to a 2×\times improvement in throughput efficiency. Moreover, the microbatch swapping mechanism enables up to a 1.8×\times throughput improvement by supporting larger batch sizes. In scenarios with system failures, \dejavu achieves a notable decrease in recovery time, thereby affirming its fault-tolerant capability.

Beyond the immediate performance enhancements, the implications of \dejavu's contributions are vast. By addressing the inefficiencies in distributed LLM serving, \dejavu not only enhances the utilization of computational resources but also opens up new possibilities for serving larger and more complex models efficiently. Future developments could explore further optimizations in KV cache management and fault tolerance, alongside expanding \dejavu's applicability to a broader range of distributed AI serving tasks.

Conclusion

In summary, \dejavu emerges as a pivotal solution to the prevailing challenges in the distributed serving of generative LLMs. Through innovative strategies like prompt-token disaggregation, microbatch swapping, and state replication, \dejavu sets a new standard for efficiency, scalability, and resilience in LLM serving systems. As the field continues to evolve, the foundational principles and mechanisms introduced by \dejavu will undoubtedly influence future research and development directions in AI and machine learning infrastructure.

X Twitter Logo Streamline Icon: https://streamlinehq.com