- The paper introduces a novel slot-level plan-and-infill paradigm that combines diffusion-based planning with autoregressive infilling, reducing complexity and enhancing coherence.
- The methodology restructures masked diffusion models to enable full key-value cache reuse with causal attention, leading to up to 18x throughput improvements.
- Empirical results demonstrate that ReFusion outperforms traditional MDMs and ARMs on multiple benchmarks, offering superior accuracy and efficiency in text generation.
ReFusion: A Diffusion LLM with Parallel Autoregressive Decoding
Introduction
REFUSION advances the field of LLM inference by fusing the strengths of masked diffusion models (MDMs) and autoregressive models (ARMs) through a novel slot-level parallel-autoregressive decoding architecture. Existing MDMs exploit flexible, parallel token generation but suffer from intractable training due to the exponential token combination space and architectural incompatibilities with Key-Value (KV) cache reuse, resulting in low throughput and semantic incoherence. Conversely, ARMs yield coherent outputs and efficient inference via KV caching but are fundamentally constrained to left-to-right, sequential decoding, bottlenecking throughput.
REFUSION addresses these core limitations by (a) elevating parallel decoding from the token to the slot (multi-token segment) level, mitigating conditional independence violations; (b) restructuring MDMs to enable complete KV cache reuse via causal attention and careful slot reordering; and (c) using a hybrid training objective to learn both slot-level planning and local, autoregressive infilling.
Methodology: The Slot-Level Plan-and-Infill Paradigm
The fundamental innovation in REFUSION is the two-stage "plan-and-infill" decoding, combining global diffusion-based planning with local autoregressive slot infilling:
- Slot Partitioning and Masking: The response sequence is partitioned into fixed-length, consecutive slots. During generation, a subset of slots remains masked.
- Step I – Diffusion-Based Planning: At each decoding iteration, the model identifies slots that are weakly interdependent (i.e., their semantic dependencies with other masked slots are minimal) using certainty scores (maximum marginal probabilities of the slot’s leading token). Slots are selected for parallel decoding if their scores exceed a given threshold.
- Step II – Autoregressive Slot Infilling: The selected slots are filled using the model’s autoregressive head. This process includes speculative decoding—generating draft tokens for verification—and accepting the longest valid prefixes for each slot where token probabilities surpass a verification threshold.
Generation proceeds iteratively, updating the sequence by inserting completed slots before masked ones in the input buffer. This slot-reordering guarantees that all preceding (filled) tokens are accessible via causal attention, enabling seamless and complete KV cache utilization.
Architectural and Training Innovations
A key insight motivating REFUSION’s slot design is the observed locality of inter-token dependency: adjacent tokens exhibit strong conditional dependence, decaying rapidly with distance. Serializing token generation within each slot substantially reduces incoherent predictions stemming from over-parallelization, while inter-slot parallelism is maintained. This leads to two synergistic benefits:
- Drastic Reduction in Learning Complexity: The model’s combinatorial generation space is reduced from the exponential token-level masking patterns (2L) to a tractable slot-level permutation space (K!⋅2K for K slots).
- Full KV Cache Reuse: Causal attention and slot reordering together allow for universal KV caching, dramatically increasing inference throughput.
REFUSION's training procedure involves randomly masking slots, permuting unmasked slots, and reordering the input sequence. The hybrid loss combines a standard autoregressive next-token loss for clean slots and a denoising loss for masked slots, providing full token-level supervision and supporting both planning and infilling behaviors.
Empirical Results
REFUSION’s performance is evaluated on seven established benchmarks covering general language understanding (MMLU-Pro, ARC-C), mathematical reasoning (GSM8K, MATH, GPQA), and code generation (HumanEval, MBPP). Key results include:
- Surpassing Prior MDMs: REFUSION outperforms best-in-class MDMs such as LLaDA and Dream by an average of 34% in accuracy/pass@1 and enables up to 18x higher throughput.
- Challenging and Exceeding ARMs: On certain benchmarks (e.g., GSM8K, MBPP), REFUSION exceeds the robust ARM Qwen3-8B by up to 3.68 absolute points, while being on average 2.33x faster in throughput.
- Robustness to Hyperparameters: Performance and throughput are robust across broad ranges of slot selection thresholds, acceptance thresholds, and slot sizes, providing a substantial “sweet spot” for optimized operation.
- Controlled Comparisons: When initialized and trained under identical data and compute, REFUSION's gains remain substantial, demonstrating that architectural advances—not data or initialization advantages—drive performance.
- KV Cache Ablation: Direct concatenation of parallel-generated slots’ KV caches preserves accuracy and enhances speed. Recomputing KV states yields no substantive performance benefit but incurs significant cost.
Comparative Analysis with Contemporary Approaches
While existing MDM accelerations (e.g., dLLM-Cache, Fast-dLLM, D2F) introduce approximations or rigidifies generation orders to exploit KV caching, they compromise global generation flexibility or still suffer from incoherence. REFUSION, by juxtaposing causal slot-level parallelism with intra-slot serial dependency, is unique in offering flexible order, high coherence, and maximal inference efficiency in a unified causal framework.
Additionally, speculative and verification-based decoding in REFUSION generalizes the draft-and-verify strategies used in speculative ARMs and confidence-based slot selection in MDMs, but within a single architecture.
Limitations and Future Directions
REFUSION enforces immutability for generated slots: once infilled, they are final, preventing iterative refinement at the sub-slot level. Future work should explore sub-slot re-masking and dynamic slot resizing for fine-grained error correction. Scaling REFUSION to larger models and datasets remains promising, as evidenced by improved throughput and performance with increased data. The application of reinforcement learning to optimize planning and slot selection policies may further enhance reasoning and structured generation.
Conclusion
REFUSION establishes a new operational frontier for MDMs, solving longstanding obstacles in masked parallel generation by introducing slot-level, hybrid diffusion–autoregressive decoding. It delivers state-of-the-art MDM performance, narrows and often overcomes the ARM–MDM efficiency/quality trade-off, and demonstrates practical viability for rapid, coherent text generation at scale. These contributions offer a robust methodological foundation for subsequent research in efficient, high-quality LLM inference.