Papers
Topics
Authors
Recent
2000 character limit reached

The CAP Principle for LLM Serving: A Survey of Long-Context Large Language Model Serving (2405.11299v2)

Published 18 May 2024 in cs.DB and cs.LG

Abstract: We survey the LLM serving area to understand the intricate dynamics between cost-efficiency and accuracy, which is magnified by the growing need for longer contextual understanding when deploying models at a massive scale. Our findings reveal that works in this space optimize along three distinct but conflicting goals: improving serving context length (C), improving serving accuracy (A), and improving serving performance (P). Drawing inspiration from the CAP theorem in databases, we propose a CAP principle for LLM serving, which suggests that any optimization can improve at most two of these three goals simultaneously. Our survey categorizes existing works within this framework. We find the definition and continuity of user-perceived measurement metrics are crucial in determining whether a goal has been met, akin to prior CAP databases in the wild. We recognize the CAP principle for LLM serving as a guiding principle, rather than a formal theorem, to inform designers of the inherent and dynamic trade-offs in serving models. As serving accuracy and performance have been extensively studied, this survey focuses on works that extend serving context length and address the resulting challenges.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (115)
  1. Levels of agi: Operationalizing progress on the path to agi. arXiv preprint arXiv:2311.02462, 2023.
  2. THE AI INDEX REPORT. https://aiindex.stanford.edu/report/.
  3. A survey of resource-efficient llm and multimodal foundation models. arXiv preprint arXiv:2401.08092, 2024.
  4. A survey on efficient inference for large language models. arXiv preprint arXiv:2404.14294, 2024.
  5. Capabilities of gemini models in medicine. arXiv preprint arXiv:2404.18416, 2024.
  6. Efficiently scaling transformer inference. Proceedings of Machine Learning and Systems, 5, 2023.
  7. Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12:157–173, 2024.
  8. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, pages 611–626, 2023.
  9. TensorRT LLM. https://github.com/NVIDIA/TensorRT-LLM.
  10. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019.
  11. Atom: Low-bit quantization for efficient and accurate llm serving. arXiv preprint arXiv:2310.19102, 2023.
  12. Efficient streaming language models with attention sinks, 2023.
  13. Orca: A distributed serving system for {{\{{Transformer-Based}}\}} generative models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 521–538, 2022.
  14. Inference without interference: Disaggregate llm inference for mixed downstream workloads. arXiv preprint arXiv:2401.11181, 2024.
  15. The shift from models to compound ai systems. https://bair.berkeley.edu/blog/2024/02/18/compound-ai-systems/, 2024.
  16. Roformer: Enhanced transformer with rotary position embedding. CoRR, abs/2104.09864, 2021.
  17. A survey on model compression for large language models. arXiv preprint arXiv:2308.07633, 2023.
  18. Wikipedia. The cap theorem. https://en.wikipedia.org/wiki/CAP_theorem, 2024.
  19. Understanding emergent abilities of language models from the loss perspective. arXiv preprint arXiv:2403.15796, 2024.
  20. Are emergent abilities of large language models a mirage? Advances in Neural Information Processing Systems, 36, 2024.
  21. Google. Spanner, truetime & the cap theorem. https://storage.googleapis.com/gweb-research2023-media/pubtools/pdf/45855.pdf, 2017.
  22. The what, why, and how of context length extension techniques in large language models–a detailed survey. arXiv preprint arXiv:2401.07872, 2024.
  23. A survey on long text modeling with transformers. arXiv preprint arXiv:2302.14502, 2023.
  24. Beyond the limits: A survey of techniques to extend the context length in large language models. arXiv preprint arXiv:2402.02244, 2024.
  25. The rise and potential of large language model based agents: A survey. arXiv preprint arXiv:2309.07864, 2023.
  26. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997, 2023.
  27. Compressive transformers for long-range sequence modelling. arXiv preprint arXiv:1911.05507, 2019.
  28. Memorizing transformers. arXiv preprint arXiv:2203.08913, 2022.
  29. Memformer: A memory-augmented transformer for sequence modeling. arXiv preprint arXiv:2010.06891, 2020.
  30. Memory transformer. arXiv preprint arXiv:2006.11527, 2020.
  31. Recurrent memory transformer. Advances in Neural Information Processing Systems, 35:11079–11091, 2022.
  32. Adapting language models to compress contexts. arXiv preprint arXiv:2305.14788, 2023.
  33. Leave no context behind: Efficient infinite context transformers with infini-attention. arXiv preprint arXiv:2404.07143, 2024.
  34. LegoOS: A disseminated, distributed OS for hardware resource disaggregation. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 69–87, Carlsbad, CA, October 2018. USENIX Association.
  35. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021.
  36. Train short, test long: Attention with linear biases enables input length extrapolation, 2022.
  37. A length-extrapolatable transformer, 2022.
  38. Clex: Continuous length extrapolation for large language models, 2024.
  39. Extending context window of large language models via positional interpolation, 2023.
  40. bloc97. Ntk-aware scaled rope allows llama models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation, 2023. https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/,D, Last accessed on 2023-12-19.
  41. Yarn: Efficient context window extension of large language models, 2023.
  42. Functional interpolation for relative positions improves long context transformers, 2024.
  43. Longrope: Extending llm context window beyond 2 million tokens, 2024.
  44. Pose: Efficient context window extension of llms via positional skip-wise training, 2024.
  45. Attention sorting combats recency bias in long context language models. arXiv preprint arXiv:2310.01427, 2023.
  46. Fortify the shortest stave in attention: Enhancing context awareness of large language models for effective tool use. arXiv preprint arXiv:2312.04455, 2023.
  47. Found in the middle: How language models use long contexts better via plug-and-play positional encoding. arXiv preprint arXiv:2403.04797, 2024.
  48. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022.
  49. Efficiently programming large language models using sglang. arXiv preprint arXiv:2312.07104, 2023.
  50. Hardware-software co-design enabling static and dynamic sparse attention mechanisms. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, pages 1–1, 2024.
  51. Adaptively sparse transformers, 2019.
  52. Sparse sinkhorn attention, 2020.
  53. Efficient content-based sparse attention with routing transformers. Transactions of the Association for Computational Linguistics, 9:53–68, 2021.
  54. Reformer: The efficient transformer, 2020.
  55. Landmark attention: Random-access infinite context length for transformers, 2023.
  56. A3: Accelerating attention mechanisms in neural networks with approximation, 2020.
  57. Spatten: Efficient sparse attention architecture with cascade token and head pruning, 2021.
  58. Sanger: A co-design framework for enabling sparse attention using reconfigurable architecture. MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, 2021.
  59. Dota: detect and omit weak attentions for scalable transformer acceleration. Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2022.
  60. Acceltran: A sparsity-aware accelerator for dynamic inference with transformers, 2023.
  61. Fact: Ffn-attention co-optimized transformer architecture with eager correlation prediction. Proceedings of the 50th Annual International Symposium on Computer Architecture, 2023.
  62. Energon: Toward efficient acceleration of transformers using dynamic sparse attention. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 42(1):136–149, 2023.
  63. Dtqatten: Leveraging dynamic token-based quantization for efficient attention architecture. In 2022 Design, Automation & Test in Europe Conference & Exhibition (DATE), pages 700–705, 2022.
  64. Blockwise self-attention for long document understanding. ArXiv, abs/1911.02972, 2019.
  65. Generating long sequences with sparse transformers, 2019.
  66. Longformer: The long-document transformer, 2020.
  67. Big bird: Transformers for longer sequences. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 17283–17297. Curran Associates, Inc., 2020.
  68. Star-transformer, 2022.
  69. LongT5: Efficient text-to-text transformer for long sequences. In Marine Carpuat, Marie-Catherine de Marneffe, and Ivan Vladimir Meza Ruiz, editors, Findings of the Association for Computational Linguistics: NAACL 2022, pages 724–736, Seattle, United States, July 2022. Association for Computational Linguistics.
  70. Longnet: Scaling transformers to 1,000,000,000 tokens, 2023.
  71. Zebra: Extending context window with layerwise grouped local-global attention, 2023.
  72. Vitcod: Vision transformer acceleration via dedicated algorithm and accelerator co-design, 2022.
  73. Salo: An efficient spatial accelerator enabling hybrid sparse attention mechanisms for long sequences, 2022.
  74. Lm-infinite: Zero-shot extreme length generalization for large language models, 2023.
  75. Model tells you what to discard: Adaptive kv cache compression for llms, 2024.
  76. H2o: Heavy-hitter oracle for efficient generative inference of large language models, 2023.
  77. Keyformer: Kv cache reduction through key tokens selection for efficient generative inference, 2024.
  78. Sparq attention: Bandwidth-efficient llm inference, 2024.
  79. On the efficacy of eviction policy for key-value constrained generative language model inference, 2024.
  80. Get more with less: Synthesizing recurrence with kv cache compression for efficient llm inference, 2024.
  81. Infllm: Unveiling the intrinsic capacity of llms for understanding extremely long sequences with training-free memory, 2024.
  82. ETC: encoding long and structured inputs in transformers. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, pages 268–284. Association for Computational Linguistics, 2020.
  83. Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning, pages 5156–5165. PMLR, 2020.
  84. Rethinking attention with performers. arXiv preprint arXiv:2009.14794, 2020.
  85. Efficient attention: Attention with linear complexities. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 3531–3539, 2021.
  86. Scatterbrain: Unifying sparse and low-rank attention. Advances in Neural Information Processing Systems, 34:17413–17426, 2021.
  87. Vitality: Unifying low-rank and sparse approximation for vision transformer acceleration with a linear taylor attention. In 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 415–428. IEEE, 2023.
  88. Online normalizer calculation for softmax. arXiv preprint arXiv:1805.02867, 2018.
  89. Self-attention does not need O⁢(n2)𝑂superscript𝑛2O(n^{2})italic_O ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) memory. arXiv preprint arXiv:2112.05682, 2021.
  90. Blockwise parallel transformers for large context models. Advances in Neural Information Processing Systems, 36, 2024.
  91. Ring attention with blockwise transformers for near-infinite context. arXiv preprint arXiv:2310.01889, 2023.
  92. Burstattention: An efficient distributed attention framework for extremely long sequences. arXiv preprint arXiv:2403.09347, 2024.
  93. Striped attention: Faster ring attention for causal transformers. arXiv preprint arXiv:2311.09431, 2023.
  94. Infinite-llm: Efficient llm service for long context with distattention and distributed kvcache. arXiv preprint arXiv:2401.02669, 2024.
  95. Loongserve: Efficiently serving long-context large language models with elastic sequence parallelism. arXiv preprint arXiv:2404.09526, 2024.
  96. Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691, 2023.
  97. Sequence parallelism: Long sequence training from system perspective. arXiv preprint arXiv:2105.13120, 2021.
  98. Reducing activation recomputation in large transformer models. Proceedings of Machine Learning and Systems, 5, 2023.
  99. Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models. arXiv preprint arXiv:2309.14509, 2023.
  100. Yucheng Li. Unlocking context constraints of llms: Enhancing context efficiency of llms with self-information-based content filtering. CoRR, abs/2304.12102, 2023.
  101. Llmlingua: Compressing prompts for accelerated inference of large language models. arXiv preprint arXiv:2310.05736, 2023.
  102. Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression. arXiv preprint arXiv:2310.06839, 2023.
  103. Llmlingua-2: Data distillation for efficient and faithful task-agnostic prompt compression. arXiv preprint arXiv:2403.12968, 2024.
  104. Learning to compress prompts with gist tokens. Advances in Neural Information Processing Systems, 36, 2024.
  105. Prompt Compression and Contrastive Conditioning for Controllability and Toxicity Reduction in Language Models. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Findings of the Association for Computational Linguistics: EMNLP 2022, pages 5621–5634, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics.
  106. In-context autoencoder for context compression in a large language model. CoRR, abs/2307.06945, 2023.
  107. Memgpt: Towards llms as operating systems. arXiv preprint arXiv:2310.08560, 2023.
  108. Walking down the memory maze: Beyond context limit through interactive reading. arXiv preprint arXiv:2310.05029, 2023.
  109. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021.
  110. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, pages 1–22, 2023.
  111. Communicative agents for software development. arXiv preprint arXiv:2307.07924, 2023.
  112. Metagpt: Meta programming for multi-agent collaborative framework. arXiv preprint arXiv:2308.00352, 2023.
  113. Self-refine: Iterative refinement with self-feedback. Advances in Neural Information Processing Systems, 36, 2024.
  114. Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems, 36, 2024.
  115. Mlcopilot: Unleashing the power of large language models in solving machine learning tasks. arXiv preprint arXiv:2304.14979, 2023.
Citations (4)

Summary

  • The paper introduces the CAP principle for LLM serving, revealing that only two out of three goals—context length, accuracy, and performance—can be optimized simultaneously.
  • It employs an extensive literature survey to assess techniques like dynamic memory, positional embeddings, and sparse attention for improving LLM deployment.
  • Findings highlight practical trade-offs where strategies such as prompt compression and distributed inference enhance scalability while balancing accuracy and computational speed.

The CAP Principle for LLM Serving

The paper entitled "The CAP Principle for LLM Serving," authored by Pai Zeng et al., proposes an analogous framework to the CAP theorem in database systems but tailored for LLMs. This framework aims to address critical trade-offs in the deployment and serving of LLMs, namely between context length, accuracy, and performance.

Introduction

The prominence of LLMs built upon transformer architectures has significantly influenced the field of artificial intelligence. Applications founded upon LLMs have not only proliferated but also begun to exceed human capabilities in areas like image classification and visual reasoning. As we move towards artificial general intelligence (AGI), the deployment and serving of these models on a massive scale in an efficient and accurate manner becomes paramount. The inherent trade-off between serving performance (e.g., tokens per second) and accuracy has historically been a challenge, further complicated by the growing demand for longer contextual understanding.

Key Observations and the CAP Principle

To explore the intricate dynamics within this space, the authors conducted an extensive survey of existing literature and proposed the CAP principle for LLM serving. Inspired by the original CAP theorem in databases, the CAP principle postulates that any optimization directed at LLM serving can enhance at most two out of three goals: context length (C), accuracy (A), and performance (P). This principle serves as a guiding rather than a formal theorem to elucidate the inherent trade-offs in serving models.

Observations

  1. Expanded Scope of Serving Systems: The serving system comprises a model serving layer and an agent serving layer. The former optimizes model structure, caching, and scheduling. The latter, emerging from complex real-world applications, leverages LLM-driven workflows to refine a model's accuracy and efficiency.
  2. Distinct Goals of Optimization: Three conflicting goals are identified: improving context length (C), improving accuracy (A), and enhancing performance (P). The survey categorizes existing works based on which of these goals they prioritize.
  3. The Trilemma: Progress in one direction (e.g., using positional embedding for context extension) does not enhance the others. For instance, techniques like quantization improve performance but may degrade accuracy.

Improving Context (C)

Model Memory

Model memory augments the transformer architecture by adding a dynamic and compressive memory system, enabling the model to capture long-range dependencies effectively. Notable works include Transformer-XL, Compressive Transformer, Memformer, and the latest, Infini-Attention, which seamlessly integrates compressive and dynamic memory with attention mechanisms.

Positional Embedding

Positional embedding techniques extend the model's context handling capability. Strategies such as ALiBi, XPOS, and CLEX are designed to handle longer contexts by either extrapolating existing positional encodings or interpolating them within the model's attention mechanism.

Improving Accuracy (A)

Addressing accuracy in the presence of long contexts poses unique challenges. Techniques like Attention Sorting, Attention Bucket, and Found-in-the-middle have been proposed, each with varying degrees of success. Found-in-the-middle demonstrates that preserving information from the pre-training phase can mitigate the long-term decay effect and improve the accuracy of long-context serving.

Improving Performance (P)

Sparse Attention

Sparse attention techniques, categorized into static and dynamic sparsity, reduce computational and memory overhead by selectively focusing on subsets of input data. Methods like Sparse Transformer, Longformer, and StreamingLLM redefine attention mechanisms to enhance computational performance while selectively sacrificing accuracy.

Linear Attention

Linear attention approximates the attention calculation, reducing the complexity from quadratic to linear. Works such as Linear Transformer and Performer utilize kernel methods to achieve this transformation, offering improved performance with manageable accuracy degradation.

Distributed Acceleration

The paper highlights advancements in sequence parallelism (SP) for distributed inference of LLMs. Noteworthy efforts include Blockwise Parallel Transformer, Ring Attention, and Elastic Sequence Parallelism, which distribute the workload across computational resources efficiently, thus optimizing performance for long-context inference tasks.

Improving Context and Performance (CP)

Prompt compression techniques like LLMLingua and Gist tokens enhance both context length and performance by condensing input sequences without significant loss of information. This line of work demonstrates that efficient prompt management can improve both scalability and responsiveness of LLM serving systems.

Improving Context and Accuracy (CA)

Agent memory systems extend serving context implicitly by dynamically managing memory and prompts within agents. Methods like MemWalker and ChatDev employ online memory management and offline reflection, creating an illusion of infinite context and enhancing task-specific accuracy over time.

Conclusion

The CAP principle for LLM serving encapsulates the fundamental trade-offs inherent in the deployment of large-scale AI models. By illustrating the intricate balance between context length, accuracy, and performance, this survey offers a structured approach to understanding and navigating the challenges of LLM serving. As both models and hardware evolve, future innovations are likely to achieve a true CAP through synergistic developments in model architectures and computational platforms.

Whiteboard

Open Problems

We found no open problems mentioned in this paper.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 5 likes about this paper.

HackerNews

  1. The Cap Principle for LLM Serving (2 points, 0 comments)