- The paper introduces a DPDA-based framework that deterministically transforms LR(1) grammars to eliminate runtime ambiguity and inefficiencies in structured LLM generation.
- It leverages prefix-conditioned edges and cycle-aware DPDA construction to precompute unique transition paths, enhancing processing efficiency.
- Experiments show up to 40% faster token generation and 36% higher throughput compared to state-of-the-art baselines, demonstrating practical scalability.
The paper "Pre3: Enabling Deterministic Pushdown Automata for Faster Structured LLM Generation" (2506.03887) addresses the computational inefficiencies of existing methods for generating structured output from LLMs, particularly when adhering to grammars like LR(1) (commonly used for formats like JSON). Current state-of-the-art approaches typically parse LR(1) grammars into a Pushdown Automaton (PDA). While PDAs handle the recursive nature of context-free grammars, their non-deterministic nature leads to significant runtime overhead, especially under large inference batch sizes. This overhead stems from the need for context-dependent token processing, requiring backtracking, speculative exploration, and complex management of a persistent stack.
Pre3 proposes a novel solution by leveraging the properties of Deterministic Pushdown Automata (DPDA). The core idea is to transform the LR(1) grammar into a DPDA during a preprocessing step. This deterministic nature allows for precomputed transition paths, eliminating the runtime ambiguity and associated overhead of traditional PDA-based methods.
The key technical contributions of Pre3 are:
- Prefix-conditioned Edges: Unlike standard PDA transitions that rely only on the current state, input symbol, and top of the stack, Pre3 introduces "Prefix-conditioned Edges." These edges require matching the input symbol and a specific prefix of symbols already on the stack (representing the parsing history). This ensures that for any given state, input, and stack configuration, the next transition is uniquely determined. This determinism is crucial for enabling ahead-of-time analysis and parallel processing of transitions.
- Cycle-aware DPDA Construction: The paper presents an algorithm to build the DPDA directly from the LR(1) state transition graph. This involves defining two types of edges:
- Acceptance Edges: Directly derived from LR(1) shift operations, corresponding to pushing state information onto the stack.
- Reduction Edges: Explicitly added to handle reduction operations (replacing a sequence of symbols with a non-terminal). This process includes resolving non-determinism by merging epsilon-reduction edges with compatible acceptance edges and incorporating prefix-conditioned stack matching. A key challenge is handling cycles in the LR(1) graph, which could lead to infinite reduction paths during construction. Pre3 addresses this by modifying back-edges in cycles to include stack pop operations that remove redundant states corresponding to a full cycle traversal, ensuring that stack information only reflects the net effect of cycle traversals.
- Edge Optimization with Prefix-condition: The deterministic and precomputed nature of DPDA edges allows for structural optimizations during preprocessing. These include:
- Edge Aggregation: Merging edges that have the same stack prefix condition and operations but accept different symbols (e.g., aggregating edges for digits 0-9).
- Edge Merging: Connecting edges that share a matched stack prefix and operations, potentially reducing the number of steps needed to reach a state. These optimizations simplify the automaton and improve runtime efficiency.
Pre3 is implemented and integrated with the LightLLM inference framework, utilizing both Python (approx. 2000 lines) and C++ (approx. 1000 lines). The DPDA construction is a one-time preprocessing step, reported to take only a few seconds for complex grammars like JSON, and the results are cacheable.
The practical benefits of Pre3 are demonstrated through extensive evaluation against state-of-the-art baselines like XGrammar, Outlines, and llama.cpp on various models (Llama-3-8B, Llama-2-70B, DeepSeek-V2-Lite-Chat, Qwen2-14B) and grammars (JSON, Chain-of-Thought). The experiments show:
- Lower per-step decoding overhead compared to baselines.
- Significant reductions in time per output token (TPOT), achieving up to a 40% improvement over XGrammar, particularly noticeable at larger batch sizes (e.g., 29-40% reduction for batch sizes 256-512).
- Increased throughput in real-world serving simulations, showing up to 36% higher throughput compared to XGrammar at higher concurrency levels. The performance gains are more pronounced as batch size and concurrency increase, highlighting Pre3's superior scalability.
While Pre3 demonstrates significant advancements for LR(1) grammars, the authors note limitations such as potential challenges with more complex LR(k) grammars (k>1) and that the current implementation is a research prototype that could benefit from production-level hardware and system optimizations. However, the efficiency of the preprocessing step makes it practical for real-world deployment.