Papers
Topics
Authors
Recent
2000 character limit reached

SGLang Runtime: Scalable LLM Inference

Updated 12 December 2025
  • SGLang Runtime is an execution environment and optimization framework that integrates a front-end DSL, graph-tracing compiler, and SGVM runtime for scalable LLM inference.
  • It features innovative cache strategies like RadixAttention and disk-based LSM-tree storage, ensuring efficient key-value reuse and resource-managed batching.
  • Adaptive control logic and parallel prompt primitives drive throughput gains, reduced first-token latency, and optimal resource allocation for multi-turn and retrieval-augmented tasks.

SGLang Runtime is an execution environment and optimization framework designed to support the efficient operation of structured LLM programs, emphasizing scalable key-value (KV) cache management, seamless parallelism, prompt program abstraction, and hardware-efficient inference. SGLang runtime integrates a front-end embedded domain-specific language (DSL), a graph-tracing compiler, and a runtime engine featuring novel cache reuse (RadixAttention), resource-managed batching, and production-grade disk-based LSM-tree KV storage for long-context and multi-process scenarios. The runtime accelerates complex reasoning, agent, and retrieval-augmented systems by maximizing KV reuse, coordinating resource allocation, and dynamically adapting to evolving workload characteristics (Zheng et al., 2023, Yu et al., 20 Nov 2025).

1. Layered System Architecture and Workflow

SGLang runtime incorporates a tripartite architecture: front-end primitives, compiler/tracer, and the SGVM runtime. The interface exposes Python-based prompt programming primitives (gen, select, extend, fork, join, run) to end-users, which are traced during execution to construct a program-specific dataflow graph. The graph executor launches stream executors for each prompt stream, dispatching operations to the underlying SGVM runtime.

The runtime coordinates three major subsystems:

  • Inference Scheduler: Invokes put_batch, probe, get_batch on the KV store for prompt streams.
  • Adaptive Controller: Monitors KV usage and tunes LSM-tree parameters (T: size ratio, K: run limit) using dynamic workload analysis.
  • Prefix-Preserving Storage Engine: Implements an LSM-index for metadata, an append-only tensor-log for payloads, and exposes batch operations and resource-managed job scheduling.

This separation of concerns supports both main-memory radix attention (for hot prefixes) and disk-backed scalable caching (for persistent and high-capacity usage), integrating seamlessly with multi-turn, retrieval, and agent pipelines (Yu et al., 20 Nov 2025, Zheng et al., 2023).

2. Programming Primitives and Prompt Graphs

SGLang extends Python with prompt-centric DSL primitives facilitating asynchrony, parallelism, and control flow:

  • gen(prompt): Asynchronous LLM call; scheduled with cache-aware batching to maximize KV prefix reuse.
  • select(prompt, choices): Constrained logit selection.
  • fork(prompt, k): Branches prompt stream k ways for parallel generation.
  • join(prompts, merge_fn): Merges multiple streams into one via a user-supplied merge function.
  • extend/+=, run, decorator: Support looped and conditional execution, embedding prompts as first-class citizens alongside structured I/O (JSON, tables).
  • Graph IR: Trace-based compilation yields an execution graph (nodes: ConstantText, Gen, Select, Fork, Join; edges: data dependencies) guiding resource scheduling, code-movement, and prefetch optimization.

This abstraction enables compositional and parallel program structures (tree-of-thought, retrieval-processing, multi-agent plans) while hiding scheduler and memory details (Zheng et al., 2023).

3. RadixAttention and KV-Cache Reuse

RadixAttention is the central cache optimization within SGLang runtime (Zheng et al., 2023). It organizes GPU KV-cache pages within a radix tree indexed by token sequences:

  • Radix Tree (CPU): Maps prefix substrings to GPU memory pool page pointers; maintains LRU timestamp and active reference count.
  • GPU Memory Pool: Stores KV arrays per token, allowing cache reuse for shared prefixes.
  • Process Algorithm: For each prompt request, matches the longest prefix in the radix tree; reuses pages for matched tokens; allocates new pages for unmatched suffix; evicts least-used leaves when space is needed.

Pseudocode:

1
2
3
4
5
6
7
8
9
10
11
12
13
function process_request(x):
    (node, ℓ) = T.match_prefix(x)  # O(m) substring comps
    pages = node.kv_pages[1:ℓ]
    new_tokens = x[ℓ+1:N]
    allocated = P.alloc(len(new_tokens))
    if not allocated:
        T.evict_pages(len(new_tokens))
        allocated = P.alloc(len(new_tokens))
    kv_new = extend_kernel(pages, new_tokens)
    for i, tok in enumerate(new_tokens):
        node = T.insert_child(node, tok)
        node.kv_page = kv_new[i]
    return decode_output(kv_new)
Computational efficiency is O(N)O(N) for prefix match, O(1)O(1) for page lookup, and memory cost is O(M×d)O(M \times d) for MM tokens and dimension dd. RadixAttention minimizes re-prefill compute: only unique tokens beyond the shared prefix are forwarded to GPU, yielding 4×5.6×4\times-5.6\times speedups on reasoning and agent benchmarks (Zheng et al., 2023).

4. Scalable Disk-Based KV Cache: LSMTree Storage Engine

To support large-context and persistent caching, SGLang runtime incorporates SGLANG-LSM (Yu et al., 20 Nov 2025), a database-inspired prefix-preserving KV cache:

  • LSM-Tree Index: Stores prefix-encoded token keys and metadata; sorted runs at each level merged via compaction; supports range scans for longest shared prefix detection.
  • Tensor-Log: Large KV tensors stored in append-only disk files; no data rewrite during compaction.
  • Batch Operations:
    • put_batch(tokens[1…m], tensors[1…m]): Appends tensors, generates WriteBatch, atomically updates LSM.
    • probe(token_prefix): Binary search in LSM for prefix; returns payload if found.
    • get_batch(token_ids[1…k]): Range scan, cluster contiguous offsets, then bulk read tensors.

Complexity is O(mlogN+sizebytes/B)O(m·\log N + \mathrm{size}_{\mathrm{bytes}}/B) for write, O(logN+k+sizeread/B)O(\log N + k + \mathrm{size}_{\mathrm{read}}/B) for batch read. Prefix-preserving key layout ensures tokens sharing prefixes are physically co-located, improving scan and probe hit-rate. Empirical hit-rate model: HitRate1exp(λS)\mathrm{HitRate} \approx 1 - \exp(-\lambda S) (Yu et al., 20 Nov 2025).

5. Adaptive Control Logic and Resource Management

The adaptive controller within SGLang-LSM dynamically tunes the LSM-tree’s architectural parameters:

  • Counters: Monitors writes (W), hits (Q), reads (R), misses (Z) over sliding window.
  • Objective: C(T,K)=wW(T,K)+sS(T,K)+rR(T,K)+zZ(T,K)C(T,K) = w·W(T,K) + s·S(T,K) + r·R(T,K) + z·Z(T,K), subject to T2,1K<TT ≥ 2, 1 ≤ K < T.
  • Dynamic Reconfiguration: Periodically solves for optimal (T,K)(T^*, K^*) minimizing cost; applies changes lazily on flush or compaction boundaries to avoid full-tree rewrite.
  • Runtime Services:
    • Batch Codec/Compression: Compresses tensor batches for disk writes.
    • Auto Tensor-File Merge: Merges small files in background when threshold FmaxF_{\max} exceeded.
    • Job Scheduler: Dispatches compaction and merge jobs based on CPU/I/O availability.

Concurrency is safeguarded via atomic two-phase commit; crash recovery is handled via metadata-payload decoupling, ensuring unreachable payloads are garbage-collected (Yu et al., 20 Nov 2025).

6. Integration, Batching, and System-Level Optimizations

SGLang runtime is designed for integration with diverse LLM inference pipelines:

  • Multi-turn/Chat: RadixAttention matches chat history prefixes for cache efficiency.
  • Retrieval-Augmented Generation: Batch-prefill for document segments; join KV trees for parallel contexts.
  • Few-shot/Reasoning: Fork/join primitives enable both prompt-level and agent-level parallelization.
  • System-Level Optimizations:
    • Graph Executor: Parallel stream execution with dependency-aware scheduling.
    • Custom Kernels: Prefill, decode, and “extend” kernels efficiently operate on non-contiguous KV allocations using hardware-specific (Triton/CUDA) routines.

On hardware, SGLang’s memory pool supports hot prefix prefetch (CPU→GPU), with negligible (<1%) overhead for radix tree maintenance. Code-movement and prefetch insertion in the compiler can reduce first-token latency by up to 80% (Zheng et al., 2023).

7. Performance and Empirical Evaluation

Empirical evaluation demonstrates substantial runtime improvements:

Configuration TTFT (4k) HitRate (%) TTFT (16k) HitRate (%)
SGLang(memory) 0.45 s 32.1 % 1.12 s 15.2 %
SGLang(file) 0.51 s 18.7 % 2.35 s 33.5 %
SGLANG-LSM 0.41 s 45.4 % 1.78 s 81.6 %

SGLANG-LSM achieves a 143% relative improvement in cache hit-rate (33.5→81.6%) and a 24% reduction in TTFT (2.35 s→1.78 s) on 16k token prompts versus state-of-the-art file-based systems. For agent and multi-chain reasoning tasks, SGLang runtime yields up to 5.6×5.6\times throughput gains and 80% first-token latency reduction through cache-aware scheduling and parallel graph execution (Yu et al., 20 Nov 2025, Zheng et al., 2023).

8. Context and Significance

SGLang runtime and its LSM-powered extension represent the first systematic application of database architectures (LSM trees, key-value separation, adaptive control) to large-scale LLM KV cache management (Yu et al., 20 Nov 2025). The co-design philosophy encompassing language abstractions, graph compilation, radix-based cache strategies, and disk-based, adaptively tuned storage positions SGLang as an enabling technology for high-throughput, multi-process, and long-context LLM applications across agent, retrieval, and reasoning domains (Zheng et al., 2023). A plausible implication is that hybrid in-memory/disk KV caching architectures will be increasingly critical for scaling LLM inference in low-latency and high-concurrency settings.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to SGLang Runtime.