Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 178 tok/s
Gemini 2.5 Pro 50 tok/s Pro
GPT-5 Medium 38 tok/s Pro
GPT-5 High 40 tok/s Pro
GPT-4o 56 tok/s Pro
Kimi K2 191 tok/s Pro
GPT OSS 120B 445 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

Set Block Decoding is a Language Model Inference Accelerator (2509.04185v1)

Published 4 Sep 2025 in cs.LG

Abstract: Autoregressive next token prediction LLMs offer powerful capabilities but face significant challenges in practical deployment due to the high computational and memory costs of inference, particularly during the decoding stage. We introduce Set Block Decoding (SBD), a simple and flexible paradigm that accelerates generation by integrating standard next token prediction (NTP) and masked token prediction (MATP) within a single architecture. SBD allows the model to sample multiple, not necessarily consecutive, future tokens in parallel, a key distinction from previous acceleration methods. This flexibility allows the use of advanced solvers from the discrete diffusion literature, offering significant speedups without sacrificing accuracy. SBD requires no architectural changes or extra training hyperparameters, maintains compatibility with exact KV-caching, and can be implemented by fine-tuning existing next token prediction models. By fine-tuning Llama-3.1 8B and Qwen-3 8B, we demonstrate that SBD enables a 3-5x reduction in the number of forward passes required for generation while achieving same performance as equivalent NTP training.

Summary

  • The paper introduces Set Block Decoding (SBD) that unifies autoregressive next-token prediction and masked token prediction to predict multiple future tokens in parallel.
  • The paper demonstrates a 3-5x reduction in forward passes with minimal accuracy loss using efficient attention masking and exact KV-caching.
  • The paper provides a practical fine-tuning solution compatible with existing models and hardware, enabling rapid deployment in latency-sensitive applications.

Set Block Decoding: Accelerating LLM Inference

Motivation and Background

Autoregressive next-token prediction (NTP) LLMs, particularly those based on the Transformer architecture, have achieved state-of-the-art results across a wide range of tasks. However, their practical deployment is hindered by the computational and memory bottlenecks of the decoding stage during inference. While the prefilling stage benefits from parallelism, decoding is inherently sequential, requiring a forward pass for each generated token and repeated access to model weights and cached key-value pairs. This results in high latency and resource consumption, especially for large models and long output sequences.

Existing acceleration methods, such as speculative decoding and blockwise parallel decoding, attempt to mitigate these issues by generating multiple tokens in parallel or using draft models for verification. However, these approaches either require architectural modifications, auxiliary models, or are limited to consecutive token blocks, restricting their flexibility and efficiency.

Set Block Decoding: Methodology

Set Block Decoding (SBD) introduces a paradigm that unifies NTP and masked token prediction (MATP) within a single Transformer architecture. SBD enables the model to predict multiple, potentially non-consecutive future tokens in parallel, conditioned on arbitrary subsets of already revealed tokens. This is achieved by leveraging bidirectional attention within the future block, while maintaining causal attention for past tokens. The architecture remains unchanged, and SBD is implemented via fine-tuning, requiring no additional training hyperparameters and preserving compatibility with exact KV-caching.

Training Procedure

SBD training combines the standard NTP loss with a masked token prediction loss. For each training sequence, a random subset of future tokens is masked, and the model is trained to predict these masked tokens conditioned on the unmasked context and any revealed future tokens. The loss function is:

$\mathcal{L}(x, \hat{x}; \theta) = -\sum_{t=2}^{L}\log p_\theta(x_t | x_{<t}) - \sum_{t \in \mathcal{T}} \sum_{i=0}^{k-1} \mathds{1}_{\hat{x}_{t+i}=\text{m}} \log p_\theta(x_{t+i} | x_{<t}; \hat{x}_{t},\ldots,\hat{x}_{t+k-1})$

where T\mathcal{T} indexes block start positions, and η\eta controls the masking probability.

Inference Procedure

During inference, SBD employs advanced masked parallel decoding algorithms, such as the Entropy Bounded (EB) Sampler. At each step, the sampler selects a subset of masked tokens with low mutual information (approximated via entropy of marginals) and decodes them in parallel. The process iterates until all tokens in the block are revealed. The single hyperparameter γ\gamma governs the speedup-accuracy tradeoff, allowing fine-grained control over parallelism.

Implementation Details

  • Attention Masking: Past tokens use causal attention; future block tokens use bidirectional attention within the block.
  • KV-Caching: SBD maintains compatibility with exact KV-caching, ensuring efficient memory usage and enabling practical deployment on existing hardware.
  • Fine-tuning: SBD can be rapidly fine-tuned from any NTP model, requiring only a change in the training objective and attention mask logic.

Experimental Results

SBD was evaluated by fine-tuning Llama-3.1 8B and Qwen-3 8B models on reasoning, coding, and mathematics benchmarks. The models were trained on 70B tokens with a 32k context length, using a mix of reasoning and instruction data. Key findings include:

  • Speedup: SBD achieves a 3-5x reduction in the number of forward passes required for generation, as measured by the number of model forwards per generated token.
  • Accuracy Preservation: For low γ\gamma values, SBD matches the performance of NTP baselines across all benchmarks. Higher γ\gamma values yield greater speedups with only minor accuracy degradation.
  • Wall-Clock Efficiency: Theoretical roofline analysis on H100 GPUs demonstrates that the reduction in model forwards translates nearly directly to wall-clock speedup, with minimal overhead for block sizes up to 16.
  • Ablations: Removing the NTP loss term during SBD training leads to significant drops in autoregressive performance, confirming the necessity of the hybrid loss. SBD requires more fine-tuning steps to reach parity with NTP, but the gap closes after sufficient training.

Trade-offs and Implementation Considerations

  • Block Size: Larger block sizes increase parallelism but may incur diminishing returns due to increased computational cost per forward pass. Empirically, block sizes up to 16 are optimal for current hardware.
  • Sampler Choice: EB-Sampler with entropy proxy outperforms alternative methods such as Factor parallel decoding, especially in high-confidence regimes.
  • Training Budget: SBD fine-tuning requires more iterations to match NTP performance, but the overall resource cost is offset by inference speedups.
  • Hardware Compatibility: SBD is designed for compatibility with existing GPU architectures and leverages efficient attention implementations (e.g., FlexAttention).

Theoretical and Practical Implications

SBD demonstrates that integrating masked token prediction into standard autoregressive models enables substantial acceleration of LLM inference without architectural changes or performance loss. This approach generalizes previous blockwise and speculative decoding methods, offering greater flexibility and control over the speedup-accuracy tradeoff. The method is particularly well-suited for deployment in latency-sensitive applications and large-scale serving environments.

Theoretically, SBD bridges the gap between autoregressive and masked modeling paradigms, enabling the use of advanced solvers from the discrete diffusion literature. Practically, it provides a drop-in solution for accelerating existing models via fine-tuning, with minimal engineering overhead.

Future Directions

Potential avenues for further research include:

  • Scaling to Larger Models: Investigating SBD's scaling properties on models beyond 8B parameters.
  • Hardware-Aware Optimization: Developing custom kernels and inference pipelines to maximize wall-clock speedup in production environments.
  • Advanced Samplers: Exploring more sophisticated parallel decoding algorithms to further improve efficiency and accuracy.
  • Hybrid Modeling: Extending SBD to support additional modeling paradigms, such as any-order autoregressive or esoteric LLMs.

Conclusion

Set Block Decoding offers a principled and practical solution for accelerating LLM inference. By unifying NTP and MATP within a single architecture and leveraging flexible parallel decoding strategies, SBD achieves significant reductions in computational cost while maintaining model performance. The method is readily applicable to existing models and hardware, and its theoretical foundations suggest broad applicability to future advances in efficient language modeling.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 10 tweets and received 148 likes.

Upgrade to Pro to view all of the tweets about this paper:

Youtube Logo Streamline Icon: https://streamlinehq.com