The CAP Principle for LLM Serving: A Survey of Long-Context Large Language Model Serving (2405.11299v2)
Abstract: We survey the LLM serving area to understand the intricate dynamics between cost-efficiency and accuracy, which is magnified by the growing need for longer contextual understanding when deploying models at a massive scale. Our findings reveal that works in this space optimize along three distinct but conflicting goals: improving serving context length (C), improving serving accuracy (A), and improving serving performance (P). Drawing inspiration from the CAP theorem in databases, we propose a CAP principle for LLM serving, which suggests that any optimization can improve at most two of these three goals simultaneously. Our survey categorizes existing works within this framework. We find the definition and continuity of user-perceived measurement metrics are crucial in determining whether a goal has been met, akin to prior CAP databases in the wild. We recognize the CAP principle for LLM serving as a guiding principle, rather than a formal theorem, to inform designers of the inherent and dynamic trade-offs in serving models. As serving accuracy and performance have been extensively studied, this survey focuses on works that extend serving context length and address the resulting challenges.
- Levels of agi: Operationalizing progress on the path to agi. arXiv preprint arXiv:2311.02462, 2023.
- THE AI INDEX REPORT. https://aiindex.stanford.edu/report/.
- A survey of resource-efficient llm and multimodal foundation models. arXiv preprint arXiv:2401.08092, 2024.
- A survey on efficient inference for large language models. arXiv preprint arXiv:2404.14294, 2024.
- Capabilities of gemini models in medicine. arXiv preprint arXiv:2404.18416, 2024.
- Efficiently scaling transformer inference. Proceedings of Machine Learning and Systems, 5, 2023.
- Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12:157–173, 2024.
- Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, pages 611–626, 2023.
- TensorRT LLM. https://github.com/NVIDIA/TensorRT-LLM.
- Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019.
- Atom: Low-bit quantization for efficient and accurate llm serving. arXiv preprint arXiv:2310.19102, 2023.
- Efficient streaming language models with attention sinks, 2023.
- Orca: A distributed serving system for {{\{{Transformer-Based}}\}} generative models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 521–538, 2022.
- Inference without interference: Disaggregate llm inference for mixed downstream workloads. arXiv preprint arXiv:2401.11181, 2024.
- The shift from models to compound ai systems. https://bair.berkeley.edu/blog/2024/02/18/compound-ai-systems/, 2024.
- Roformer: Enhanced transformer with rotary position embedding. CoRR, abs/2104.09864, 2021.
- A survey on model compression for large language models. arXiv preprint arXiv:2308.07633, 2023.
- Wikipedia. The cap theorem. https://en.wikipedia.org/wiki/CAP_theorem, 2024.
- Understanding emergent abilities of language models from the loss perspective. arXiv preprint arXiv:2403.15796, 2024.
- Are emergent abilities of large language models a mirage? Advances in Neural Information Processing Systems, 36, 2024.
- Google. Spanner, truetime & the cap theorem. https://storage.googleapis.com/gweb-research2023-media/pubtools/pdf/45855.pdf, 2017.
- The what, why, and how of context length extension techniques in large language models–a detailed survey. arXiv preprint arXiv:2401.07872, 2024.
- A survey on long text modeling with transformers. arXiv preprint arXiv:2302.14502, 2023.
- Beyond the limits: A survey of techniques to extend the context length in large language models. arXiv preprint arXiv:2402.02244, 2024.
- The rise and potential of large language model based agents: A survey. arXiv preprint arXiv:2309.07864, 2023.
- Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997, 2023.
- Compressive transformers for long-range sequence modelling. arXiv preprint arXiv:1911.05507, 2019.
- Memorizing transformers. arXiv preprint arXiv:2203.08913, 2022.
- Memformer: A memory-augmented transformer for sequence modeling. arXiv preprint arXiv:2010.06891, 2020.
- Memory transformer. arXiv preprint arXiv:2006.11527, 2020.
- Recurrent memory transformer. Advances in Neural Information Processing Systems, 35:11079–11091, 2022.
- Adapting language models to compress contexts. arXiv preprint arXiv:2305.14788, 2023.
- Leave no context behind: Efficient infinite context transformers with infini-attention. arXiv preprint arXiv:2404.07143, 2024.
- LegoOS: A disseminated, distributed OS for hardware resource disaggregation. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 69–87, Carlsbad, CA, October 2018. USENIX Association.
- The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021.
- Train short, test long: Attention with linear biases enables input length extrapolation, 2022.
- A length-extrapolatable transformer, 2022.
- Clex: Continuous length extrapolation for large language models, 2024.
- Extending context window of large language models via positional interpolation, 2023.
- bloc97. Ntk-aware scaled rope allows llama models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation, 2023. https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/,D, Last accessed on 2023-12-19.
- Yarn: Efficient context window extension of large language models, 2023.
- Functional interpolation for relative positions improves long context transformers, 2024.
- Longrope: Extending llm context window beyond 2 million tokens, 2024.
- Pose: Efficient context window extension of llms via positional skip-wise training, 2024.
- Attention sorting combats recency bias in long context language models. arXiv preprint arXiv:2310.01427, 2023.
- Fortify the shortest stave in attention: Enhancing context awareness of large language models for effective tool use. arXiv preprint arXiv:2312.04455, 2023.
- Found in the middle: How language models use long contexts better via plug-and-play positional encoding. arXiv preprint arXiv:2403.04797, 2024.
- Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022.
- Efficiently programming large language models using sglang. arXiv preprint arXiv:2312.07104, 2023.
- Hardware-software co-design enabling static and dynamic sparse attention mechanisms. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, pages 1–1, 2024.
- Adaptively sparse transformers, 2019.
- Sparse sinkhorn attention, 2020.
- Efficient content-based sparse attention with routing transformers. Transactions of the Association for Computational Linguistics, 9:53–68, 2021.
- Reformer: The efficient transformer, 2020.
- Landmark attention: Random-access infinite context length for transformers, 2023.
- A3: Accelerating attention mechanisms in neural networks with approximation, 2020.
- Spatten: Efficient sparse attention architecture with cascade token and head pruning, 2021.
- Sanger: A co-design framework for enabling sparse attention using reconfigurable architecture. MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, 2021.
- Dota: detect and omit weak attentions for scalable transformer acceleration. Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2022.
- Acceltran: A sparsity-aware accelerator for dynamic inference with transformers, 2023.
- Fact: Ffn-attention co-optimized transformer architecture with eager correlation prediction. Proceedings of the 50th Annual International Symposium on Computer Architecture, 2023.
- Energon: Toward efficient acceleration of transformers using dynamic sparse attention. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 42(1):136–149, 2023.
- Dtqatten: Leveraging dynamic token-based quantization for efficient attention architecture. In 2022 Design, Automation & Test in Europe Conference & Exhibition (DATE), pages 700–705, 2022.
- Blockwise self-attention for long document understanding. ArXiv, abs/1911.02972, 2019.
- Generating long sequences with sparse transformers, 2019.
- Longformer: The long-document transformer, 2020.
- Big bird: Transformers for longer sequences. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 17283–17297. Curran Associates, Inc., 2020.
- Star-transformer, 2022.
- LongT5: Efficient text-to-text transformer for long sequences. In Marine Carpuat, Marie-Catherine de Marneffe, and Ivan Vladimir Meza Ruiz, editors, Findings of the Association for Computational Linguistics: NAACL 2022, pages 724–736, Seattle, United States, July 2022. Association for Computational Linguistics.
- Longnet: Scaling transformers to 1,000,000,000 tokens, 2023.
- Zebra: Extending context window with layerwise grouped local-global attention, 2023.
- Vitcod: Vision transformer acceleration via dedicated algorithm and accelerator co-design, 2022.
- Salo: An efficient spatial accelerator enabling hybrid sparse attention mechanisms for long sequences, 2022.
- Lm-infinite: Zero-shot extreme length generalization for large language models, 2023.
- Model tells you what to discard: Adaptive kv cache compression for llms, 2024.
- H2o: Heavy-hitter oracle for efficient generative inference of large language models, 2023.
- Keyformer: Kv cache reduction through key tokens selection for efficient generative inference, 2024.
- Sparq attention: Bandwidth-efficient llm inference, 2024.
- On the efficacy of eviction policy for key-value constrained generative language model inference, 2024.
- Get more with less: Synthesizing recurrence with kv cache compression for efficient llm inference, 2024.
- Infllm: Unveiling the intrinsic capacity of llms for understanding extremely long sequences with training-free memory, 2024.
- ETC: encoding long and structured inputs in transformers. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, pages 268–284. Association for Computational Linguistics, 2020.
- Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning, pages 5156–5165. PMLR, 2020.
- Rethinking attention with performers. arXiv preprint arXiv:2009.14794, 2020.
- Efficient attention: Attention with linear complexities. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 3531–3539, 2021.
- Scatterbrain: Unifying sparse and low-rank attention. Advances in Neural Information Processing Systems, 34:17413–17426, 2021.
- Vitality: Unifying low-rank and sparse approximation for vision transformer acceleration with a linear taylor attention. In 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 415–428. IEEE, 2023.
- Online normalizer calculation for softmax. arXiv preprint arXiv:1805.02867, 2018.
- Self-attention does not need O(n2)𝑂superscript𝑛2O(n^{2})italic_O ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) memory. arXiv preprint arXiv:2112.05682, 2021.
- Blockwise parallel transformers for large context models. Advances in Neural Information Processing Systems, 36, 2024.
- Ring attention with blockwise transformers for near-infinite context. arXiv preprint arXiv:2310.01889, 2023.
- Burstattention: An efficient distributed attention framework for extremely long sequences. arXiv preprint arXiv:2403.09347, 2024.
- Striped attention: Faster ring attention for causal transformers. arXiv preprint arXiv:2311.09431, 2023.
- Infinite-llm: Efficient llm service for long context with distattention and distributed kvcache. arXiv preprint arXiv:2401.02669, 2024.
- Loongserve: Efficiently serving long-context large language models with elastic sequence parallelism. arXiv preprint arXiv:2404.09526, 2024.
- Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691, 2023.
- Sequence parallelism: Long sequence training from system perspective. arXiv preprint arXiv:2105.13120, 2021.
- Reducing activation recomputation in large transformer models. Proceedings of Machine Learning and Systems, 5, 2023.
- Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models. arXiv preprint arXiv:2309.14509, 2023.
- Yucheng Li. Unlocking context constraints of llms: Enhancing context efficiency of llms with self-information-based content filtering. CoRR, abs/2304.12102, 2023.
- Llmlingua: Compressing prompts for accelerated inference of large language models. arXiv preprint arXiv:2310.05736, 2023.
- Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression. arXiv preprint arXiv:2310.06839, 2023.
- Llmlingua-2: Data distillation for efficient and faithful task-agnostic prompt compression. arXiv preprint arXiv:2403.12968, 2024.
- Learning to compress prompts with gist tokens. Advances in Neural Information Processing Systems, 36, 2024.
- Prompt Compression and Contrastive Conditioning for Controllability and Toxicity Reduction in Language Models. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Findings of the Association for Computational Linguistics: EMNLP 2022, pages 5621–5634, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics.
- In-context autoencoder for context compression in a large language model. CoRR, abs/2307.06945, 2023.
- Memgpt: Towards llms as operating systems. arXiv preprint arXiv:2310.08560, 2023.
- Walking down the memory maze: Beyond context limit through interactive reading. arXiv preprint arXiv:2310.05029, 2023.
- Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021.
- Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, pages 1–22, 2023.
- Communicative agents for software development. arXiv preprint arXiv:2307.07924, 2023.
- Metagpt: Meta programming for multi-agent collaborative framework. arXiv preprint arXiv:2308.00352, 2023.
- Self-refine: Iterative refinement with self-feedback. Advances in Neural Information Processing Systems, 36, 2024.
- Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems, 36, 2024.
- Mlcopilot: Unleashing the power of large language models in solving machine learning tasks. arXiv preprint arXiv:2304.14979, 2023.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.