Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

119 tokens/sec

GPT-4o

56 tokens/sec

Gemini 2.5 Pro Pro

43 tokens/sec

o3 Pro

6 tokens/sec

GPT-4.1 Pro

47 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

207 3 2

SGLang: Efficient Execution of Structured Language Model Programs (2312.07104v2)

Published 12 Dec 2023 in cs.AI and cs.PL

Abstract: LLMs are increasingly used for complex tasks that require multiple generation calls, advanced prompting techniques, control flow, and structured inputs/outputs. However, efficient systems are lacking for programming and executing these applications. We introduce SGLang, a system for efficient execution of complex LLM programs. SGLang consists of a frontend language and a runtime. The frontend simplifies programming with primitives for generation and parallelism control. The runtime accelerates execution with novel optimizations like RadixAttention for KV cache reuse and compressed finite state machines for faster structured output decoding. Experiments show that SGLang achieves up to 6.4x higher throughput compared to state-of-the-art inference systems on various large language and multi-modal models on tasks including agent control, logical reasoning, few-shot learning benchmarks, JSON decoding, retrieval-augmented generation pipelines, and multi-turn chat. The code is publicly available at https://github.com/sgl-project/sglang

References (78)

Authors (12)

Lianmin Zheng (34 papers)
Liangsheng Yin (2 papers)
Zhiqiang Xie (15 papers)
Jeff Huang (15 papers)
Chuyue Sun (7 papers)
Cody Hao Yu (13 papers)
Shiyi Cao (15 papers)
Christos Kozyrakis (31 papers)
Ion Stoica (177 papers)
Joseph E. Gonzalez (167 papers)
Clark Barrett (86 papers)
Ying Sheng (31 papers)

Citations (41)

View on Semantic Scholar

Summary

Efficiently Programming LLMs using SGLang

The paper introduces SGLang, a Structured Generation Language, specifically designed to enhance the efficiency of programming LLMs. LLMs are increasingly employed for complex tasks such as multi-round dialogue management, reasoning, and interactions requiring intricate control flows and multiple LLM generation calls. However, existing systems often lack the capability to handle these applications efficiently.

Contributions of SGLang

SGLang is presented as a domain-specific language embedded within Python, providing targeted primitives to streamline LLM programming. It supports enhancing execution efficiency through various optimizations. These include parallelism, batching, and caching, alongside novel compilation techniques. The language allows for the manipulation of prompts, generation of outputs, and control of generation processes via its primitives, which integrate seamlessly with Python's control flows.

Central to the paper is the introduction of RadixAttention, an advanced method for cache reuse. RadixAttention maintains a Least Recently Used (LRU) cache of key-value pairs structured in a radix tree, facilitating automatic reuse across generation calls. This innovation forms a cornerstone of SGLang's ability to reduce redundancy and optimize efficiency.

Experimental Results

The experiments demonstrate SGLang's efficacy, showcasing speedups of up to 5x on typical LLM tasks, alongside reduced code complexity. Among the applications successfully optimized are agent control, logical reasoning, content generation, benchmarking, and processing long documents. These improvements underline SGLang’s capacity to enhance both performance and usability.

In comparison with other systems, SGLang demonstrates significant advantages. While contemporary languages such as LMQL and Guidance also support LLM programming, they lack the backend optimization capabilities of SGLang, contributing to less efficient runtime performance. In contrast, inference engines like vLLM, while high-performing within single-generation calls, do not leverage the programmatic insights needed for broader optimizations inherent to LLM applications.

Implications and Future Directions

The proposed framework holds both practical and theoretical implications. Practically, SGLang can significantly benefit industries leveraging LLMs by simplifying prompt management and execution processes, thus reducing costs and increasing throughput. Theoretically, it opens pathways for further exploration into the co-design of programmatic languages and runtime environments specifically tailored for machine learning models.

Future developments may focus on expanding the capabilities of SGLang to support other modalities and more complex control flows. Additionally, exploring deeper integration with existing AI systems and benchmarks could further expand its utility.

Conclusion

SGLang emerges as a robust solution to the inherent inefficiencies in current LLM programming paradigms. By co-designing the language and runtime, this work establishes a comprehensive framework that both enhances performance and eases the development process of sophisticated LLM applications. Through its innovative design, SGLang represents an important stride in efficiently harnessing the capabilities of LLMs.

Tweets

https://twitter.com/hu_yifei/status/1820130580803010569

https://twitter.com/SciumoInc/status/1747697954889023744

https://twitter.com/haozhangml/status/1819806074683424777

https://twitter.com/sachinjose/status/1819083233570755009

https://twitter.com/mctalentowen/status/1799452235157676226

YouTube

Show All Videos

HackerNews

Efficient Execution of Structured Language Model Programs (3 points, 0 comments)