Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Earley-Driven Dynamic Pruning for Efficient Structured Decoding (2506.01151v1)

Published 1 Jun 2025 in cs.LG, cs.AI, and cs.CL

Abstract: LLMs have shown remarkable capabilities, yet ensuring their outputs conform to strict structural or grammatical constraints remains challenging, which is critical in function calls and domain-specific language (DSL) generation. Constrained decoding with context-free grammar is a flexible approach to guarantee LLMs' adherence to a specific format by dynamically building a token logits mask. However, creating this mask requires checking the validity of all tokens in the LLM vocabulary at every decoding step, which often incurs significant overheads in existing constrained decoding engines. To address this challenge, we propose $\textbf{ZapFormat}$, a novel $\textbf{dynamic pruning}$ strategy based on the Earley algorithm that identifies and eliminates invalid or redundant Earley states in real-time, significantly reducing memory occupation of the Earley algorithm's states. This further enables us to use a state cache to speed up structured generations on a large number of queries. We implemented ZapFormat in a new constrained decoding engine called Formatron which also incorporates existing optimizations. Through comprehensive experiments on structured generation tasks, including JSON generation, JSON Schema validation, and semantic parsing, we demonstrate that Formatron not only $\textbf{consistently maintains}$ high-precision compliant outputs but also achieves $\textbf{significant improvements}$ in inference speed up to 2x compared to state-of-the-art implementations. More importantly, Formatron is generally applicable across various LLM architectures. We release Formatron as open source at https://github.com/Dan-wanna-M/formatron.

Summary

  • The paper introduces Formatron, a decoding engine that dynamically prunes redundant Earley states to reduce computational overhead and memory usage.
  • It integrates novel strategies such as context-independent token caching and rejection prefix optimization to enforce strict structural constraints in outputs.
  • Experimental evaluations show up to 2x throughput improvements and lower memory consumption while maintaining competitive accuracy across various LLM architectures.

This paper (Earley-Driven Dynamic Pruning for Efficient Structured Decoding, 1 Jun 2025) introduces Formatron, a new engine designed for efficient structured decoding with LLMs. The core challenge addressed is ensuring LLM outputs strictly conform to structural or grammatical constraints, which is vital for applications like function calls, domain-specific language generation, and structured data generation (e.g., JSON).

Existing constrained decoding methods often rely on Context-Free Grammars (CFGs) to dynamically mask invalid tokens. While flexible, these approaches face significant overheads:

  1. Computational Overhead: Checking the validity of all tokens in a large vocabulary at each decoding step is computationally expensive, especially with long sequences or complex grammars.
  2. State Redundancy: Parsing algorithms like the Earley algorithm accumulate intermediate states. Many of these states can become obsolete but continue to consume memory, leading to cache misses and reduced performance.

Formatron tackles these issues primarily through ZapFormat, a novel dynamic pruning strategy based on the Earley algorithm. The key idea is to identify and eliminate invalid or redundant Earley states in real-time, drastically reducing memory footprint and computational load.

Key Technical Components:

  1. ZapFormat (Dynamic Pruning): This is the core innovation. It extends the standard Earley item notation to include the input span covered by a rule application: (Aαβ,i,j)(A \to \alpha \cdot \beta, i, j), where [i,j][i, j] is the span.
    • Dependencies: Three types of dependencies between Earley items are defined based on the Earley algorithm's operations: Predict, Scan, and Complete.
    • Dependency Graph: A directed graph is maintained where vertices are Earley items and edges represent these dependencies. This graph tracks how items are related and which items are necessary for subsequent steps.
    • Reachability and Dynamic Pruning: An item is considered "reachable" if there's a dependency path from it to any item in the last constructed Earley set. Formatron maintains an "active item set" consisting only of reachable items. After the Complete phase and before the Predict phase, a "Compact" operation prunes all items from the current set that are not in the active item set. This removes "dead" or "idle" states that will no longer contribute to a valid parse. The process uses an incremental update strategy for efficiency.
  2. Context-Independent Tokens Mask Cache: Inspired by prior work (e.g., XGrammar (XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models, 22 Nov 2024)), Formatron pre-computes and caches validity masks for "context-independent" tokens. These are tokens whose validity can be determined solely from the next expected terminal symbols in the grammar rules, without requiring full parsing context. This reduces redundant computations at runtime.
  3. Rejection Prefix Optimization: Formatron identifies "rejected prefixes" – minimal sequences of tokens that guarantee an invalid parse regardless of subsequent tokens. If the generated prefix matches a rejected prefix, the parsing can be immediately terminated or corresponding states discarded, enhancing efficiency by avoiding exploration of impossible paths.
  4. Grammar Transformation: The engine incorporates standard grammar optimization techniques like removing useless rules (those that cannot contribute to a valid derivation) and handling null rules (rules that can derive an empty string). These transformations reduce the size and complexity of the grammar, improving parsing efficiency.

Implementation:

Formatron is implemented as a constrained decoding engine that integrates the ZapFormat pruning strategy with the other optimizations. It is designed to be generally applicable across various LLM architectures. The authors have released Formatron as open-source software.

Experimental Evaluation:

The paper evaluates Formatron on structured generation tasks including:

  • Geoquery transformation (natural language to FunQL)
  • JSON Schema validation (generating JSON compliant with a schema)
  • JSON Grammar generation (generating syntactically and semantically valid JSON)

They compare Formatron against state-of-the-art baselines like lm-format-enforcer (Sketch: A Toolkit for Streamlining LLM Operations, 5 Sep 2024), outlines (Efficient Guided Generation for Large Language Models, 2023), and XGrammar (XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models, 22 Nov 2024) using various LLMs (Gemma-2-9b-it, Llama-3-8B-Instruct, Mistral-7B-Instruct-v0.3, Qwen2.5-7B-Instruct).

Results:

  • Throughput: Formatron consistently achieves significant speed improvements, up to 2x higher throughput compared to the baselines in many scenarios, especially on multi-run evaluations which benefit from caching.
  • Robustness: Formatron demonstrates strong performance across different LLM architectures and diverse constrained decoding tasks, maintaining stable and high-efficiency performance.
  • Ablation Study: Experiments show that the dynamic pruning mechanism is crucial for performance, contributing significantly to throughput gains. The state cache also provides further benefits.
  • Memory Usage: Ablation studies also confirm that the pruning mechanism effectively reduces the maximum memory consumed during constrained decoding.
  • Accuracy: While the primary focus is efficiency, additional experiments demonstrate that Formatron maintains competitive accuracy in generating compliant outputs.

Practical Applications:

Formatron's ability to efficiently enforce complex structural constraints makes it highly practical for various real-world scenarios:

  • Function Calling: Ensuring LLM outputs adhere to predefined API or function call formats (like JSON).
  • Structured Data Generation: Generating outputs that conform to specific data formats like JSON, XML, or database query languages (e.g., SQL, FunQL).
  • Code Generation: Guiding LLMs to generate code snippets that adhere to specific programming language syntax.
  • Templated Text Generation: Filling structured templates with LLM-generated content while ensuring the output structure remains intact.

By significantly reducing the computational and memory overhead associated with CFG-based constrained decoding, Formatron enables more efficient and scalable deployment of LLMs in applications requiring strictly formatted outputs. The open-source release of Formatron lowers the barrier to entry for developers needing to implement efficient structured generation.

Youtube Logo Streamline Icon: https://streamlinehq.com