Papers
Topics
Authors
Recent
2000 character limit reached

XGrammar 2: High-Performance Grammar Systems

Updated 10 January 2026
  • XGrammar 2 is a framework that combines a dynamic, LLM-focused generation engine with a declarative, streaming XML parser to enhance structured text processing.
  • It leverages advanced techniques like TagDispatch, JIT mask compilation, FSM hashing, and Earley parsing to minimize computational overhead and improve runtime efficiency.
  • Empirical results show 6–10× per-token speedups and over 100× reductions in compilation time, making it highly effective for both LLM inference and XML parsing.

XGrammar 2 refers to two independent systems: (1) a dynamic, high-efficiency structured generation engine (“Contour”) targeting agentic LLMs (Li et al., 7 Jan 2026), and (2) a declarative, streaming grammar language for XML domain-specific languages (Clark, 2015). Both employ advanced formal machinery to optimize for expressivity, performance, and low overhead in structured text generation and parsing.

1. TagDispatch and Dynamic Grammar Dispatch

XGrammar 2 introduces TagDispatch, a dispatch mechanism optimized for dynamic structured generation in LLM-based agents (Li et al., 7 Jan 2026). In agentic tasks (e.g., tool-calling, stepwise conditional reasoning), sequence generation often begins with a dedicated “tag” (e.g., <function=foo>, 〈|channel|〉msg), after which generation must proceed under a specialized context-free grammar (CFG).

TagDispatch Structure

Formally, a TagDispatch instance is a triple:

TD=(T,G,Sstop)TD = (T,\, G,\, S_{\text{stop}})

where TT is a finite set of tag strings, G={Gi}G = \{G_i\} are corresponding grammars, and SstopS_{\text{stop}} is a set of stop-strings. Decoding alternates between:

  • Dispatching Mode: Uses an Aho–Corasick automaton to scan for any tag in TT amidst generated tokens; only very lightweight token masking is required.
  • Dispatched Mode: On matching a tag tit_i, control passes to GiG_i, enforcing its CFG via mask generation until completed, then returns to dispatching.

A crucial performance gain derives from deferring expensive mask construction and cache occupancy for GiG_i until its tag tit_i is seen, drastically reducing memory and computational overhead versus static approaches where all tool schemas are unioned and tracked simultaneously.

2. Just-in-Time Mask Compilation

Grammar-constrained decoding necessitates a mapping from token-prefixes to allowable next-token sets (“masks”). Precomputing this adaptive mask-cache for large, highly adaptive grammars is prohibitive: compilation cost can reach several seconds per grammar (Li et al., 7 Jan 2026).

XGrammar 2 adopts a partial-JIT (just-in-time) strategy:

  • In a prefilling phase, the states are sorted by estimated compile cost; the top KK are precompiled within a user-specified time budget, the rest are marked compile-on-demand.
  • When decoding reaches an uncached state, mask computation is performed only as needed; runtime spikes are often hidden by overlapping with LLM steps.

Reported speed-ups are extrema: on the JSONSchemaBench, preprocessing drops from 4960 ms (static) to 612 ms (JIT), with further reductions from auxiliary techniques reducing to single-digit milliseconds.

3. Cross-Grammar Caching via FSM Hashing

Mask computation for CFGs is expensive, but sub-grammars (e.g., “string”, “number”, common object patterns) are ubiquitous across many schemas. XGrammar 2 minimizes redundant computation by hashing minimized FSM representations of grammar fragments, thereby recognizing and reusing structurally identical subgraphs (Li et al., 7 Jan 2026).

FSM Hashing and Cache Lookup

  • Each production rule is converted to a minimized FSM and a canonical 64-bit hash is computed, resolving cycles to guarantee uniqueness.
  • The token-mask cache is keyed by (fsm-hash, lookahead signature); upon a cache miss, partial results can be reused if only lookahead differs, reducing recomputation.
  • This mechanism is critical for dynamic settings where grammars composed at runtime share repeated patterns.

4. Efficient Mask Generation Algorithms

PDA and Earley-Parser-Based Masking

Push-down automaton (PDA) approaches underlie many CFG-constrained decoders. XGrammar 2 extends this with an Earley parser backend, enabling efficient coverage of non-LL(1) grammars, where PDA states can explode combinatorially.

  • Earley’s dynamic programming table E[j]E[j] encodes “Earley items”; only states that can accept a terminal at the next token, called “scannable,” require full caching.
  • For each scannable state ss, three disjoint subsets of tokens are identified: accepted (AsA_s), rejected (RsR_s), and uncertain (UsU_s)—with uncertain entries resolved via further lookahead.
  • This yields linear growth in cache size with grammar complexity and amortizes cache lookups to O(1+Uslookahead cost)O(1 + |U_s|\cdot \text{lookahead cost}).

The worst-case time is O(n3)O(n^3), but practical grammars are nearly deterministic, yielding O(n)O(n)O(n2)O(n^2).

Repetition Compression

High-arity repetition rules (e.g., AB0,65536A \to B^{0,65536}) can generate tens of thousands of distinct PDA/Earley states with negligible practical difference in masks. XGrammar 2 compresses repetition by expanding only up to threshold TT explicit copies, summarizing the intervening states with compact repetition operators. This reduces state space to O(T)O(T) and drastically improves both cache hits and mask-inference sharpness for large, variadic structures.

5. Performance Analysis and Empirical Results

XGrammar 2 (“Contour”) has been empirically evaluated on multiple structured-generation and function-calling workloads (Li et al., 7 Jan 2026).

Engine Per-token overhead Grammar compile time
XGrammar 200–400 μs 1,000–1,200 ms
llguidance 250–1,000 μs
Contour 30–80 μs 10–15 ms
  • In function-calling tasks (BFCL-v3 and SGLang), XGrammar 2 delivered end-to-end overheads of <6% compared to unconstrained decoding, corresponding to a 7x speed-up over XGrammar.
  • Ablation studies on JSONSchemaBench show compounding reductions: Earley-only (4960 ms, 45 μs/mask), +JIT (612 ms, 722 μs), +cross-grammar (535 ms, 334 μs), +repetition compression (5.4 ms, 126 μs).

End-to-end, XGrammar 2 achieves 6–10× per-token speedup and >100× compilation-time reduction over XGrammar.

6. Integration with LLM Inference and Engine Architecture

XGrammar 2 is designed for tight coupling with modern LLM inference pipelines (e.g., SGLang, vLLM). The typical processing pipeline is:

  • Tokenizer encodes prompt and history to prefix IDs.
  • LLM encoder computes logits for possible next tokens.
  • XGrammar 2’s mask module intercepts logits, applies the relevant mask depending on dispatching mode (TagDispatch vs. specific subgrammar), and returns masked logits to the sampler.
  • Minimal hooks are required—engine-agnostic integration is enabled via a mask_callback() interface.

CPU+GPU evaluation on RTX 5090 and Xeon Platinum yields <10 ms per mask generation step, netting <6% total overhead due to overlap with GPU LLM execution.

7. Declarative Grammar Processing for XML (Clark XGrammar 2)

A separate “XGrammar 2” system, as described by Clark (Clark, 2015), is a declarative grammar formalism designed for streaming, table-driven, high-performance XML parsing atop SAX. Key features include:

  • Grammars specified in BNF/EBNF-like style, supporting element patterns, bindings, repetition, disjunction, and semantic actions.
  • Parsing specified via inference rules matching input sequences to synthesized values, with the grammar’s operational semantics defined by a prediction-table-driven abstract machine.
  • All parsing occurs with LL(1) predictiveness, enabling constant-time dispatch per event and eliminating the need to materialize full XML trees in memory (unlike DOM).
  • The machine state consists of code pointer, environment, value stack, SAX event queue, and return stack, all manipulated via a fixed set of transition rules.

This grammar engine achieves O(n)O(n) parse time (with nn SAX events), constant per-event cost, and is suitable for large or streaming XML inputs, rivaling hand-tuned efficiency while being substantially more maintainable.

Summary Table: Principal Features of XGrammar 2 Systems

System Domain Core Techniques Evaluated Speedup
Contour (2026) LLM Structured Gen. TagDispatch, JIT, FSM caching, Earley, repetition compression 6–10x per-token, 100x compile (Li et al., 7 Jan 2026)
Clark XGrammar XML Parsing Declarative LL(1) grammars, table-driven stack machine Comparable to tuned SAX (Clark, 2015)

Both strands of XGrammar 2 offer formal, efficient mechanisms for syntax-driven generation or recognition in their respective domains, with demonstrated low overhead, high reusability, and empirical scalability.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to XGrammar 2.