Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SGLang: Efficient Execution of Structured Language Model Programs (2312.07104v2)

Published 12 Dec 2023 in cs.AI and cs.PL

Abstract: LLMs are increasingly used for complex tasks that require multiple generation calls, advanced prompting techniques, control flow, and structured inputs/outputs. However, efficient systems are lacking for programming and executing these applications. We introduce SGLang, a system for efficient execution of complex LLM programs. SGLang consists of a frontend language and a runtime. The frontend simplifies programming with primitives for generation and parallelism control. The runtime accelerates execution with novel optimizations like RadixAttention for KV cache reuse and compressed finite state machines for faster structured output decoding. Experiments show that SGLang achieves up to 6.4x higher throughput compared to state-of-the-art inference systems on various large language and multi-modal models on tasks including agent control, logical reasoning, few-shot learning benchmarks, JSON decoding, retrieval-augmented generation pipelines, and multi-turn chat. The code is publicly available at https://github.com/sgl-project/sglang

Definition Search Book Streamline Icon: https://streamlinehq.com
References (78)
  1. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
  2. Deepspeed-inference: enabling efficient inference of transformer models at unprecedented scale. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–15. IEEE, 2022.
  3. Graph of thoughts: Solving elaborate problems with large language models. arXiv preprint arXiv:2308.09687, 2023.
  4. Prompting is programming: A query language for large language models. Proceedings of the ACM on Programming Languages, 7(PLDI):1946–1969, 2023.
  5. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  6. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
  7. {{\{{TVM}}\}}: An automated {{\{{End-to-End}}\}} optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 578–594, 2018.
  8. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  9. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  10. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022.
  11. Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale. Advances in Neural Information Processing Systems, 35:30318–30332, 2022.
  12. Tensorir: An abstraction for automatic tensorized program optimization. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, pages 804–817, 2023.
  13. Sparsegpt: Massive language models can be accurately pruned in one-shot. In International Conference on Machine Learning, pages 10323–10337. PMLR, 2023.
  14. Optq: Accurate quantization for generative pre-trained transformers. In The Eleventh International Conference on Learning Representations, 2022.
  15. Georgi Gerganov. Llama.cpp. https://github.com/ggerganov/llama.cpp. Accessed: 2023-11.
  16. Guidance AI. A guidance language for controlling large language models. https://github.com/guidance-ai/guidance. Accessed: 2023-11.
  17. Microsecond-scale preemption for concurrent {{\{{GPU-accelerated}}\}}{{\{{DNN}}\}} inferences. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 539–558, 2022.
  18. Measuring massive multitask language understanding. In International Conference on Learning Representations, 2020.
  19. Hugging Face. Text generation inference. https://github.com/huggingface/text-generation-inference. Accessed: 2023-11.
  20. Taso: optimizing deep learning computation with automatic generation of graph substitutions. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, pages 47–62, 2019.
  21. Dspy: Compiling declarative language model calls into self-improving pipelines. arXiv preprint arXiv:2310.03714, 2023.
  22. Validating large language models with relm. Proceedings of Machine Learning and Systems, 5, 2023.
  23. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, pages 611–626, 2023.
  24. LangChain AI. Langchain. https://github.com/langchain-ai/langchain. Accessed: 2023-11.
  25. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pages 19274–19286. PMLR, 2023.
  26. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020.
  27. Camel: Communicative agents for "mind" exploration of large language model society. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  28. Awq: Activation-aware weight quantization for llm compression and acceleration. arXiv preprint arXiv:2306.00978, 2023.
  29. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023.
  30. Jerry Liu. LlamaIndex, November 2022.
  31. Prompting frameworks for large language models: A survey. arXiv preprint arXiv:2311.12785, 2023.
  32. Flexflow: A flexible dataflow accelerator architecture for convolutional neural networks. In 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 553–564. IEEE, 2017.
  33. Rammer: Enabling holistic deep learning compiler optimizations with {{\{{rTasks}}\}}. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 881–897, 2020.
  34. Specinfer: Accelerating generative llm serving with speculative inference and token tree verification. arXiv preprint arXiv:2305.09781, 2023.
  35. Model TC. Lightllm. https://github.com/ModelTC/lightllm. Accessed: 2023-11.
  36. A tensor compiler for unified machine learning prediction serving. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 899–917, 2020.
  37. Skeleton-of-thought: Large language models can do parallel decoding. arXiv preprint arXiv:2307.15337, 2023.
  38. NVIDIA. Tensorrt-llm. https://github.com/NVIDIA/TensorRT-LLM. Accessed: 2023-11.
  39. OpenAI. Openai api reference. https://platform.openai.com/docs/api-reference. Accessed: 2023-11.
  40. OpenAI. Gpt-4 technical report, 2023.
  41. Memgpt: Towards llms as operating systems, 2023.
  42. Generative agents: Interactive simulacra of human behavior. In In the 36th Annual ACM Symposium on User Interface Software and Technology (UIST ’23), UIST ’23, New York, NY, USA, 2023. Association for Computing Machinery.
  43. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
  44. Gorilla: Large language model connected with massive apis. arXiv preprint arXiv:2305.15334, 2023.
  45. Efficiently scaling transformer inference. Proceedings of Machine Learning and Systems, 5, 2023.
  46. Parallel context windows for large language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6383–6402, 2023.
  47. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
  48. Branch-solve-merge improves large language model evaluation and generation. arXiv preprint arXiv:2310.15123, 2023.
  49. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761, 2023.
  50. Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. arXiv preprint arXiv:2303.17580, 2023.
  51. S-lora: Serving thousands of concurrent lora adapters. arXiv preprint arXiv:2311.03285, 2023.
  52. Flexgen: high-throughput generative inference of large language models with a single gpu. In International Conference on Machine Learning, pages 31094–31116. PMLR, 2023.
  53. Reflexion: Language agents with verbal reinforcement learning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  54. Significant Gravitas. AutoGPT.
  55. Ugache: A unified gpu cache for embedding-based deep learning. In Proceedings of the 29th Symposium on Operating Systems Principles, pages 627–641, 2023.
  56. Blockwise parallel decoding for deep autoregressive models. Advances in Neural Information Processing Systems, 31, 2018.
  57. Triton: an intermediate language and compiler for tiled neural network computations. In Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, pages 10–19, 2019.
  58. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  59. Unity: Accelerating {{\{{DNN}}\}} training through joint optimization of algebraic transformations and parallelization. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 267–284, 2022.
  60. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  61. Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv: Arxiv-2305.16291, 2023.
  62. {{\{{PET}}\}}: Optimizing tensor programs with partially equivalent transformations and automated corrections. In 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21), pages 37–54, 2021.
  63. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations, 2022.
  64. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  65. Efficient guided generation for large language models. arXiv e-prints, pages arXiv–2307, 2023.
  66. Fast distributed inference serving for large language models. arXiv preprint arXiv:2305.05920, 2023.
  67. Autogen: Enabling next-gen llm applications via multi-agent conversation framework. 2023.
  68. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, pages 38087–38099. PMLR, 2023.
  69. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2018.
  70. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601, 2023.
  71. React: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations, 2022.
  72. Sparsetir: Composable abstractions for sparse compilation in deep learning. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, pages 660–678, 2023.
  73. Answering questions by meta-reasoning over multiple chains of thought. arXiv preprint arXiv:2304.13007, 2023.
  74. Orca: A distributed serving system for {{\{{Transformer-Based}}\}} generative models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 521–538, 2022.
  75. Dynamic control flow in large-scale machine learning. In Proceedings of the Thirteenth EuroSys Conference, pages 1–15, 2018.
  76. Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, 2019.
  77. Ansor: Generating {{\{{High-Performance}}\}} tensor programs for deep learning. In 14th USENIX symposium on operating systems design and implementation (OSDI 20), pages 863–879, 2020.
  78. Alpa: Automating inter-and {{\{{Intra-Operator}}\}} parallelism for distributed deep learning. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 559–578, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (12)
  1. Lianmin Zheng (34 papers)
  2. Liangsheng Yin (2 papers)
  3. Zhiqiang Xie (15 papers)
  4. Jeff Huang (15 papers)
  5. Chuyue Sun (7 papers)
  6. Cody Hao Yu (13 papers)
  7. Shiyi Cao (15 papers)
  8. Christos Kozyrakis (31 papers)
  9. Ion Stoica (177 papers)
  10. Joseph E. Gonzalez (167 papers)
  11. Clark Barrett (86 papers)
  12. Ying Sheng (31 papers)
Citations (41)

Summary

Efficiently Programming LLMs using SGLang

The paper introduces SGLang, a Structured Generation Language, specifically designed to enhance the efficiency of programming LLMs. LLMs are increasingly employed for complex tasks such as multi-round dialogue management, reasoning, and interactions requiring intricate control flows and multiple LLM generation calls. However, existing systems often lack the capability to handle these applications efficiently.

Contributions of SGLang

SGLang is presented as a domain-specific language embedded within Python, providing targeted primitives to streamline LLM programming. It supports enhancing execution efficiency through various optimizations. These include parallelism, batching, and caching, alongside novel compilation techniques. The language allows for the manipulation of prompts, generation of outputs, and control of generation processes via its primitives, which integrate seamlessly with Python's control flows.

Central to the paper is the introduction of RadixAttention, an advanced method for cache reuse. RadixAttention maintains a Least Recently Used (LRU) cache of key-value pairs structured in a radix tree, facilitating automatic reuse across generation calls. This innovation forms a cornerstone of SGLang's ability to reduce redundancy and optimize efficiency.

Experimental Results

The experiments demonstrate SGLang's efficacy, showcasing speedups of up to 5x on typical LLM tasks, alongside reduced code complexity. Among the applications successfully optimized are agent control, logical reasoning, content generation, benchmarking, and processing long documents. These improvements underline SGLang’s capacity to enhance both performance and usability.

In comparison with other systems, SGLang demonstrates significant advantages. While contemporary languages such as LMQL and Guidance also support LLM programming, they lack the backend optimization capabilities of SGLang, contributing to less efficient runtime performance. In contrast, inference engines like vLLM, while high-performing within single-generation calls, do not leverage the programmatic insights needed for broader optimizations inherent to LLM applications.

Implications and Future Directions

The proposed framework holds both practical and theoretical implications. Practically, SGLang can significantly benefit industries leveraging LLMs by simplifying prompt management and execution processes, thus reducing costs and increasing throughput. Theoretically, it opens pathways for further exploration into the co-design of programmatic languages and runtime environments specifically tailored for machine learning models.

Future developments may focus on expanding the capabilities of SGLang to support other modalities and more complex control flows. Additionally, exploring deeper integration with existing AI systems and benchmarks could further expand its utility.

Conclusion

SGLang emerges as a robust solution to the inherent inefficiencies in current LLM programming paradigms. By co-designing the language and runtime, this work establishes a comprehensive framework that both enhances performance and eases the development process of sophisticated LLM applications. Through its innovative design, SGLang represents an important stride in efficiently harnessing the capabilities of LLMs.

Youtube Logo Streamline Icon: https://streamlinehq.com