XGrammar 2: High-Performance Grammar Systems
- XGrammar 2 is a framework that combines a dynamic, LLM-focused generation engine with a declarative, streaming XML parser to enhance structured text processing.
- It leverages advanced techniques like TagDispatch, JIT mask compilation, FSM hashing, and Earley parsing to minimize computational overhead and improve runtime efficiency.
- Empirical results show 6–10× per-token speedups and over 100× reductions in compilation time, making it highly effective for both LLM inference and XML parsing.
XGrammar 2 refers to two independent systems: (1) a dynamic, high-efficiency structured generation engine (“Contour”) targeting agentic LLMs (Li et al., 7 Jan 2026), and (2) a declarative, streaming grammar language for XML domain-specific languages (Clark, 2015). Both employ advanced formal machinery to optimize for expressivity, performance, and low overhead in structured text generation and parsing.
1. TagDispatch and Dynamic Grammar Dispatch
XGrammar 2 introduces TagDispatch, a dispatch mechanism optimized for dynamic structured generation in LLM-based agents (Li et al., 7 Jan 2026). In agentic tasks (e.g., tool-calling, stepwise conditional reasoning), sequence generation often begins with a dedicated “tag” (e.g., <function=foo>, 〈|channel|〉msg), after which generation must proceed under a specialized context-free grammar (CFG).
TagDispatch Structure
Formally, a TagDispatch instance is a triple:
where is a finite set of tag strings, are corresponding grammars, and is a set of stop-strings. Decoding alternates between:
- Dispatching Mode: Uses an Aho–Corasick automaton to scan for any tag in amidst generated tokens; only very lightweight token masking is required.
- Dispatched Mode: On matching a tag , control passes to , enforcing its CFG via mask generation until completed, then returns to dispatching.
A crucial performance gain derives from deferring expensive mask construction and cache occupancy for until its tag is seen, drastically reducing memory and computational overhead versus static approaches where all tool schemas are unioned and tracked simultaneously.
2. Just-in-Time Mask Compilation
Grammar-constrained decoding necessitates a mapping from token-prefixes to allowable next-token sets (“masks”). Precomputing this adaptive mask-cache for large, highly adaptive grammars is prohibitive: compilation cost can reach several seconds per grammar (Li et al., 7 Jan 2026).
XGrammar 2 adopts a partial-JIT (just-in-time) strategy:
- In a prefilling phase, the states are sorted by estimated compile cost; the top are precompiled within a user-specified time budget, the rest are marked compile-on-demand.
- When decoding reaches an uncached state, mask computation is performed only as needed; runtime spikes are often hidden by overlapping with LLM steps.
Reported speed-ups are extrema: on the JSONSchemaBench, preprocessing drops from 4960 ms (static) to 612 ms (JIT), with further reductions from auxiliary techniques reducing to single-digit milliseconds.
3. Cross-Grammar Caching via FSM Hashing
Mask computation for CFGs is expensive, but sub-grammars (e.g., “string”, “number”, common object patterns) are ubiquitous across many schemas. XGrammar 2 minimizes redundant computation by hashing minimized FSM representations of grammar fragments, thereby recognizing and reusing structurally identical subgraphs (Li et al., 7 Jan 2026).
FSM Hashing and Cache Lookup
- Each production rule is converted to a minimized FSM and a canonical 64-bit hash is computed, resolving cycles to guarantee uniqueness.
- The token-mask cache is keyed by (fsm-hash, lookahead signature); upon a cache miss, partial results can be reused if only lookahead differs, reducing recomputation.
- This mechanism is critical for dynamic settings where grammars composed at runtime share repeated patterns.
4. Efficient Mask Generation Algorithms
PDA and Earley-Parser-Based Masking
Push-down automaton (PDA) approaches underlie many CFG-constrained decoders. XGrammar 2 extends this with an Earley parser backend, enabling efficient coverage of non-LL(1) grammars, where PDA states can explode combinatorially.
- Earley’s dynamic programming table encodes “Earley items”; only states that can accept a terminal at the next token, called “scannable,” require full caching.
- For each scannable state , three disjoint subsets of tokens are identified: accepted (), rejected (), and uncertain ()—with uncertain entries resolved via further lookahead.
- This yields linear growth in cache size with grammar complexity and amortizes cache lookups to .
The worst-case time is , but practical grammars are nearly deterministic, yielding –.
Repetition Compression
High-arity repetition rules (e.g., ) can generate tens of thousands of distinct PDA/Earley states with negligible practical difference in masks. XGrammar 2 compresses repetition by expanding only up to threshold explicit copies, summarizing the intervening states with compact repetition operators. This reduces state space to and drastically improves both cache hits and mask-inference sharpness for large, variadic structures.
5. Performance Analysis and Empirical Results
XGrammar 2 (“Contour”) has been empirically evaluated on multiple structured-generation and function-calling workloads (Li et al., 7 Jan 2026).
| Engine | Per-token overhead | Grammar compile time |
|---|---|---|
| XGrammar | 200–400 μs | 1,000–1,200 ms |
| llguidance | 250–1,000 μs | – |
| Contour | 30–80 μs | 10–15 ms |
- In function-calling tasks (BFCL-v3 and SGLang), XGrammar 2 delivered end-to-end overheads of <6% compared to unconstrained decoding, corresponding to a 7x speed-up over XGrammar.
- Ablation studies on JSONSchemaBench show compounding reductions: Earley-only (4960 ms, 45 μs/mask), +JIT (612 ms, 722 μs), +cross-grammar (535 ms, 334 μs), +repetition compression (5.4 ms, 126 μs).
End-to-end, XGrammar 2 achieves 6–10× per-token speedup and >100× compilation-time reduction over XGrammar.
6. Integration with LLM Inference and Engine Architecture
XGrammar 2 is designed for tight coupling with modern LLM inference pipelines (e.g., SGLang, vLLM). The typical processing pipeline is:
- Tokenizer encodes prompt and history to prefix IDs.
- LLM encoder computes logits for possible next tokens.
- XGrammar 2’s mask module intercepts logits, applies the relevant mask depending on dispatching mode (TagDispatch vs. specific subgrammar), and returns masked logits to the sampler.
- Minimal hooks are required—engine-agnostic integration is enabled via a
mask_callback()interface.
CPU+GPU evaluation on RTX 5090 and Xeon Platinum yields <10 ms per mask generation step, netting <6% total overhead due to overlap with GPU LLM execution.
7. Declarative Grammar Processing for XML (Clark XGrammar 2)
A separate “XGrammar 2” system, as described by Clark (Clark, 2015), is a declarative grammar formalism designed for streaming, table-driven, high-performance XML parsing atop SAX. Key features include:
- Grammars specified in BNF/EBNF-like style, supporting element patterns, bindings, repetition, disjunction, and semantic actions.
- Parsing specified via inference rules matching input sequences to synthesized values, with the grammar’s operational semantics defined by a prediction-table-driven abstract machine.
- All parsing occurs with LL(1) predictiveness, enabling constant-time dispatch per event and eliminating the need to materialize full XML trees in memory (unlike DOM).
- The machine state consists of code pointer, environment, value stack, SAX event queue, and return stack, all manipulated via a fixed set of transition rules.
This grammar engine achieves parse time (with SAX events), constant per-event cost, and is suitable for large or streaming XML inputs, rivaling hand-tuned efficiency while being substantially more maintainable.
Summary Table: Principal Features of XGrammar 2 Systems
| System | Domain | Core Techniques | Evaluated Speedup |
|---|---|---|---|
| Contour (2026) | LLM Structured Gen. | TagDispatch, JIT, FSM caching, Earley, repetition compression | 6–10x per-token, 100x compile (Li et al., 7 Jan 2026) |
| Clark XGrammar | XML Parsing | Declarative LL(1) grammars, table-driven stack machine | Comparable to tuned SAX (Clark, 2015) |
Both strands of XGrammar 2 offer formal, efficient mechanisms for syntax-driven generation or recognition in their respective domains, with demonstrated low overhead, high reusability, and empirical scalability.