Sequential Diffusion Language Models (2509.24007v1)

Published 28 Sep 2025 in cs.CL and cs.LG

Abstract: Diffusion LLMs (DLMs) have strong theoretical efficiency but are limited by fixed-length decoding and incompatibility with key-value (KV) caches. Block diffusion mitigates these issues, yet still enforces a fixed block size and requires expensive training. We introduce Next Sequence Prediction (NSP), which unifies next-token and next-block prediction, enabling the model to adaptively determine the generation length at each step. When the length is fixed to 1, NSP reduces to standard next-token prediction. Building on NSP, we propose Sequential Diffusion LLM (SDLM), which can retrofit pre-trained autoregressive LLMs (ALMs) at minimal cost. Specifically, SDLM performs diffusion inference within fixed-size mask blocks, but dynamically decodes consecutive subsequences based on model confidence, thereby preserving KV-cache compatibility and improving robustness to varying uncertainty and semantics across the sequence. Experiments show that SDLM matches or surpasses strong autoregressive baselines using only 3.5M training samples, while achieving 2.1 higher throughput than Qwen-2.5. Notably, the SDLM-32B model delivers even more pronounced efficiency gains, demonstrating the strong scalability potential of our modeling paradigm. Project page and codes: https://github.com/OpenGVLab/SDLM

Summary

The paper introduces the Next Sequence Prediction (NSP) paradigm, unifying autoregressive and diffusion approaches for dynamic, efficient sequence generation.
It leverages parallel block training and structured attention masks to retrofit pretrained models for faster inference with controlled speed-accuracy trade-offs.
Experimental results show that SDLM achieves competitive performance with reduced training costs and strong scalability, validated across diverse benchmarks.

Sequential Diffusion LLMs: Bridging Autoregressive and Diffusion Paradigms

Introduction

The paper introduces Sequential Diffusion LLMs (SDLM), a novel framework that unifies autoregressive and diffusion-based language modeling via the Next Sequence Prediction (NSP) paradigm. SDLM is designed to overcome the limitations of both autoregressive LLMs (ALMs) and diffusion LLMs (DLMs), specifically addressing the inefficiencies of strictly sequential decoding and the rigidity of fixed-length parallel generation. The approach enables dynamic, confidence-driven generation of contiguous token subsequences, retrofits existing ALMs with minimal cost, and preserves compatibility with key-value (KV) caching for efficient inference.

Figure 1: Comparison of decoding paradigms: ALMs (token-by-token), DLMs (fixed block), SDLM (dynamic subsequence), and the speed-accuracy trade-off on MATH-500.

Methodology

Next Sequence Prediction (NSP)

NSP generalizes next-token and next-block prediction by allowing the model to adaptively determine the length of the generated sequence at each decoding step. Formally, NSP defines an autoregressive probability distribution over variable-length sequences, where the model predicts a contiguous subsequence of tokens conditioned on the prefix and a masked block. When the prediction length is set to one, NSP reduces to standard next-token prediction, ensuring seamless integration with pretrained ALMs.

SDLM Architecture and Training

SDLM leverages parallel block training with a structured attention mask. Each block attends bidirectionally within itself and causally to its prefix, enabling efficient parallelization and dynamic sequence generation. The training objective minimizes the cross-entropy over all masked tokens in the block, with shifted prediction to exploit the strong first-token prediction capability of ALMs.

Figure 2: Structured attention mask for parallel block training and sampling, illustrating causal prefix, cross-block visibility, and intra-block bidirectional attention.

Parallel training is achieved by interleaving noise blocks and target blocks, with custom attention masks enforcing causality and local bidirectionality. This design allows SDLM to be fine-tuned from existing ALMs using standard instruction datasets, drastically reducing training cost and compute requirements.

Inference: Longest Prefix Decoding

During inference, SDLM employs Longest Prefix Decoding, dynamically selecting the number of tokens to output per step based on model confidence. Two confidence metrics are explored: logit-based (softmax probability) and entropy-normalized. The decoding length $\gamma_\tau(Z^i)$ is determined by a threshold $\tau$ on cumulative confidence, balancing speed and accuracy. Additionally, self-speculative decoding is implemented, verifying the correctness of parallel predictions via consistency checks, further enhancing reliability at the cost of additional forward passes.

Experimental Results

SDLM is evaluated on a comprehensive suite of benchmarks, including general, mathematical, and coding tasks. Models are fine-tuned on Qwen-2.5 (3B and 32B) using 3.5M samples, and compared against ALMs (Qwen-2.5-SFT) and DLMs (Dream-7B, LLaDA-8B).

Key findings include:

SDLM-32B achieves 92.4 on GSM8K, 74.2 on MATH-500, and 78.6 on IFEval, matching or surpassing ALM baselines.
SDLM-3B outperforms larger DLMs (Dream-7B, LLaDA-8B) with substantially higher inference speed and lower training cost.
SDLM generates approximately 2 tokens per forward pass, yielding a 2.1 $\times$ speedup over Qwen-2.5-3B.
Lowering the confidence threshold $\tau$ increases throughput (up to 2.7 tokens/step) with minimal accuracy degradation, demonstrating a controllable speed-accuracy trade-off.
Figure 3: Trade-off between performance and speed for SDLM-3B ( $D=4$ and $D=8$ ), showing the effect of varying $\tau$ on throughput and accuracy.

Short-answer benchmarks (MMLU, Winogrande, Hellaswag, ARC) confirm that SDLM retains the semantic and reasoning capabilities of the base ALMs, with performance within 1 percentage point of autoregressive baselines.

Ablation Studies

Block Size and Attention Mask

Increasing block size ( $D$ ) enhances throughput with only marginal performance loss. Bidirectional attention within blocks improves parallel generation efficiency compared to causal masking, as evidenced by higher average tokens per step.

Figure 4: Ablation on attention mask type and prediction shift strategy, quantifying the impact on model performance and acceleration ratio.

Shifted prediction is critical; omitting it leads to significant performance degradation and increased output repetition, highlighting the importance of leveraging ALM's first-token prediction strength.

Self-Speculative Decoding

Self-speculative decoding further increases accepted tokens per step (up to 5.4 for $D=8$ ) with comparable accuracy, at the expense of additional validation overhead. This method is particularly effective in scenarios demanding high responsiveness.

Adaptive Decoding Visualization

SDLM's dynamic decoding adapts sequence length to local context, emitting longer subsequences in fluent regions and shorter ones in uncertain or branching contexts.

Figure 5: Visualization of the sampling process, with each blue block representing a subsequence generated in a single decoding step.

Theoretical and Practical Implications

SDLM demonstrates that diffusion-based generation can be effectively retrofitted onto existing ALMs, preserving KV-cache compatibility and enabling dynamic, parallel decoding. The NSP paradigm unifies autoregressive and blockwise diffusion approaches, offering a flexible framework for efficient sequence generation. The ability to control the speed-accuracy trade-off via confidence thresholds or speculative verification is particularly valuable for deployment in latency-sensitive applications.

The results suggest strong scalability: SDLM-32B matches SFT-tuned ALMs with only 3.5M training samples, indicating that the approach is viable for large-scale models and longer training regimes. The minimal architectural changes required for extending prediction horizon (simply appending mask tokens) further facilitate practical adoption.

Future Directions

Potential avenues for future research include:

Extending SDLM to multimodal and multilingual settings, leveraging its flexible decoding paradigm.
Investigating adaptive block sizing strategies conditioned on input semantics or uncertainty.
Integrating advanced speculative decoding mechanisms to further reduce latency.
Exploring the impact of NSP-based training on model calibration and uncertainty estimation.

Conclusion

Sequential Diffusion LLMs (SDLM) present a unified framework for efficient, dynamic sequence generation, bridging the gap between autoregressive and diffusion paradigms. By retrofitting pretrained ALMs with parallel block training and adaptive decoding, SDLM achieves competitive performance with substantial inference acceleration. The NSP paradigm offers a principled approach to balancing speed and quality, with strong empirical results and scalability. This work lays the foundation for further exploration of unified sequence generation models in both research and production environments.