Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache (2401.02669v2)

Published 5 Jan 2024 in cs.DC and cs.AR
Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache

Abstract: LLMs demonstrate substantial potential across a diverse array of domains via request serving. However, as trends continue to push for expanding context sizes, the autoregressive nature of LLMs results in highly dynamic behavior of the attention layers, showcasing significant differences in computational characteristics and memory requirements from the non-attention layers. This presents substantial challenges for resource management and performance optimization in service systems. Existing static model parallelism and resource allocation strategies fall short when dealing with this dynamicity. To address the issue, we propose Infinite-LLM, a novel LLM serving system designed to effectively handle dynamic context lengths. Infinite-LLM disaggregates attention layers from an LLM's inference process, facilitating flexible and independent resource scheduling that optimizes computational performance and enhances memory utilization jointly. By leveraging a pooled GPU memory strategy across a cluster, Infinite-LLM not only significantly boosts system throughput but also supports extensive context lengths. Evaluated on a dataset with context lengths ranging from a few to 2000K tokens across a cluster with 32 A100 GPUs, Infinite-LLM demonstrates throughput improvement of 1.35-3.4x compared to state-of-the-art methods, enabling efficient and elastic LLM deployment.

Efficient Management of LLM Services for Long Contexts in Cloud Environments

Introduction to DistAttention and DistKV-LLM

In the rapidly evolving landscape of AI and machine learning, LLMs have emerged as foundational blocks, driving advances in diverse applications ranging from chatbots to automated content generation. However, as these models scale, particularly in cloud-based services, they present unique challenges in managing the extensive computational and memory resources required, especially for tasks involving long-context sequences. This paper introduces a significant stride towards addressing these challenges through DistAttention, a novel distributed attention algorithm, and DistKV-LLM, an innovative engine optimized for efficient management of distributed Key-Value (KV) Caches.

The Challenges Addressed

The dynamic and auto-regressive nature of LLMs imposes difficulties in predetermining resources, particularly for tasks with variable context lengths. This unpredictability often leads to inefficient resource allocation, impacting performance and scalability within cloud environments. The traditional model parallelism techniques, while useful, fall short when dealing with the memory demands imposed by long-context sequences. Moreover, existing solutions like live migration or memory swapping, despite their potential, introduce significant overheads or fail to utilize available resources effectively.

DistAttention: A Distributed Attention Mechanism

DistAttention addresses these challenges by segmenting the KV Cache into smaller, manageable units, enabling distributed processing across a cloud-based environment. This not only facilitates efficient memory management but also circumvents the performance bottlenecks associated with data swapping or live migrations. By leveraging all accessible GPU and CPU memory resources across the data center, DistAttention optimizes resource utilization, significantly enhancing the system’s adaptability and performance for long-context tasks.

DistKV-LLM: Streamlining KV Cache Management

Building on the foundation laid by DistAttention, DistKV-LLM emerges as a distributed LLM service engine optimized for managing KV Caches effectively across distributed GPUs and CPUs in a data center. It introduces a sophisticated protocol ensuring scalable, coherent interactions among numerous LLM service instances, addressing the challenges associated with the dynamicity and unpredictability of resource demands. DistKV-LLM’s architecture prioritizes data locality and communication efficiency, crucial for maintaining performance in long-context scenarios.

Evaluation and Findings

The proposed system was rigorously tested in a cloud setup equipped with 32 NVIDIA A100 GPUs across various configurations. Through extensive benchmarking with 18 datasets, the system demonstrated remarkable performance improvements (1.03-2.4 times higher throughput) and supported context lengths 2-19 times longer than existing state-of-the-art LLM service systems. These results not only validate the effectiveness of the proposed solutions but also highlight the potential for substantial performance gains in practical deployments.

Implications and Future Directions

The innovations presented in this paper, comprising DistAttention and DistKV-LLM, offer a powerful toolkit for optimizing LLM services in cloud environments. By addressing the core challenges associated with long-context sequence tasks, this research paves the way for more efficient, scalable, and adaptable LLM services. Looking ahead, the principles and mechanisms outlined here could inspire new avenues of research and development, focusing on leveraging distributed computing resources for advanced AI applications.

In summary, this paper contributes significantly to the field of AI and cloud computing by offering a robust solution to the pressing challenges of managing LLM services for long-context tasks. As LLMs continue to grow in size and complexity, the strategies and technologies developed in this work will undoubtedly play a critical role in the future evolution of cloud-based AI services.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. Nvidia collective communication library. https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/index.html, 2020.
  2. Fastertransformer. https://github.com/NVIDIA/FasterTransformer, 2021.
  3. Amazon s3: Object storage built to retrieve any amount of data from anywhere. https://aws.amazon.com/s3, 2023.
  4. Deepspeed-fastgen: High-throughput text generation for llms via mii and deepspeed-inference. https://github.com/microsoft/DeepSpeed/tree/master/blogs/deepspeed-fastgen, 2023.
  5. Large language model text generation inference. https://huggingface.co/docs/text-generation-inference, 2023.
  6. Lmdeploy. https://github.com/InternLM/lmdeploy, 2023.
  7. Simple, safe way to store and distribute tensors. https://huggingface.co/docs/safetensors, 2023.
  8. Tensorrt-llm. https://github.com/NVIDIA/TensorRT-LLM, 2023.
  9. Sarathi: Efficient llm inference by piggybacking decodes with chunked prefills. arXiv preprint arXiv:2308.16369, 2023.
  10. Deepspeed-inference: enabling efficient inference of transformer models at unprecedented scale. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–15. IEEE, 2022.
  11. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
  12. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.
  13. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  14. A survey on evaluation of large language models, 2023.
  15. Evaluating large language models trained on code, 2021.
  16. Generating long sequences with sparse transformers, 2019.
  17. Dapple: A pipelined data parallel approach for training large models. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 431–445, 2021.
  18. Low latency rnn inference with cellular batching. In Proceedings of the Thirteenth EuroSys Conference, pages 1–15, 2018.
  19. Github. https://github.com/features/copilot, 2022.
  20. Google. https://bard.google.com, 2023.
  21. Lm-infinite: Simple on-the-fly length generalization for large language models. arXiv preprint arXiv:2308.16137, 2023.
  22. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems, 32, 2019.
  23. Beyond data and model parallelism for deep neural networks. Proceedings of Machine Learning and Systems, 1:1–13, 2019.
  24. Mistral 7b, 2023.
  25. Using rdma efficiently for key-value services. ACM SIGCOMM Computer Communication Review, 44:295 – 306, 2014.
  26. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, pages 611–626, 2023.
  27. Alpaserve: Statistical multiplexing with model parallelism for deep learning serving. arXiv preprint arXiv:2302.11665, 2023.
  28. Blockwise parallel transformer for long context large models. arXiv preprint arXiv:2305.19370, 2023.
  29. Ring attention with blockwise transformers for near-infinite context. arXiv preprint arXiv:2310.01889, 2023.
  30. Gpteval: A survey on assessments of chatgpt and gpt-4. arXiv preprint arXiv:2308.12488, 2023.
  31. Ray: A distributed framework for emerging {{\{{AI}}\}} applications. In 13th USENIX symposium on operating systems design and implementation (OSDI 18), pages 561–577, 2018.
  32. Pipedream: Generalized pipeline parallelism for dnn training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, pages 1–15, 2019.
  33. Memory-efficient pipeline-parallel dnn training. In International Conference on Machine Learning, pages 7937–7947. PMLR, 2021.
  34. OpenAI. https://openai.com/blog/chatgpt, 2022.
  35. OpenAI. Gpt-4 technical report, 2023.
  36. Service level agreement in cloud computing. 2009.
  37. Yarn: Efficient context window extension of large language models, 2023.
  38. Efficiently scaling transformer inference. Proceedings of Machine Learning and Systems, 5, 2023.
  39. Improving language understanding by generative pre-training. 2018.
  40. Noam Shazeer. Fast transformer decoding: One write-head is all you need. arXiv preprint arXiv:1911.02150, 2019.
  41. Megatron-lm: Training multi-billion parameter language models using model parallelism, 2020.
  42. Generating text with recurrent neural networks. In Proceedings of the 28th international conference on machine learning (ICML-11), pages 1017–1024, 2011.
  43. Beyond the hype: Assessing the performance, trustworthiness, and clinical suitability of gpt3. 5. arXiv preprint arXiv:2306.15887, 2023.
  44. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  45. Focused transformer: Contrastive training for context scaling. arXiv preprint arXiv:2307.03170, 2023.
  46. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  47. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
  48. Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453, 2023.
  49. Petuum: A new platform for distributed machine learning on big data. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1335–1344, 2015.
  50. Orca: A distributed serving system for {{\{{Transformer-Based}}\}} generative models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 521–538, 2022.
  51. H22{}_{2}start_FLOATSUBSCRIPT 2 end_FLOATSUBSCRIPTo: Heavy-hitter oracle for efficient generative inference of large language models, 2023.
  52. Alpa: Automating inter-and {{\{{Intra-Operator}}\}} parallelism for distributed deep learning. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 559–578, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (15)
  1. Bin Lin (33 papers)
  2. Tao Peng (53 papers)
  3. Chen Zhang (403 papers)
  4. Minmin Sun (3 papers)
  5. Lanbo Li (1 paper)
  6. Hanyu Zhao (23 papers)
  7. Wencong Xiao (10 papers)
  8. Xiafei Qiu (5 papers)
  9. Shen Li (77 papers)
  10. Zhigang Ji (4 papers)
  11. Yong Li (628 papers)
  12. Wei Lin (207 papers)
  13. Anmin Liu (4 papers)
  14. Zhipeng Zhang (50 papers)
  15. Tao Xie (117 papers)
Citations (31)