Sequential Diffusion Language Model (SDLM)
- The paper introduces Next Sequence Prediction (NSP) to unify token-level and block-level generation, enabling dynamic, high-throughput text synthesis.
- SDLM leverages both autoregressive and diffusion paradigms for adaptive mask-block inference and seamless key-value caching without architectural changes.
- Experimental results show SDLM scales effectively, matching state-of-the-art quality while achieving up to 2.1× throughput improvement in diverse generation tasks.
A Sequential Diffusion LLM (SDLM) is a neural text generation paradigm that merges autoregressive sequence modeling and diffusion-based block-wise denoising. SDLMs introduce core advances—such as Next Sequence Prediction (NSP) and dynamic mask-block inference—that enable flexible, efficient, and high-throughput generation in natural language and code domains, while retaining compatibility with essential deployment features such as key-value caching. By adaptively retrofitting pre-trained autoregressive LLMs, SDLMs achieve strong scalability and can outperform or match state-of-the-art autoregressive baselines in quality and throughput (Liu et al., 28 Sep 2025).
1. Modeling Framework and Theoretical Foundation
Sequential Diffusion LLMs build upon two complementary paradigms: autoregressive LLMs (ALMs) and discrete diffusion models (DLMs). In ALMs, text is generated token by token using a strictly causal attention window, optimizing the objective
with inference supported by key-value (KV) caches.
In standard DLMs, masked blocks are denoised in parallel, by minimizing the block-diffusion loss
but fixed-length outputs and blockwise rigidities preclude flexible sequence generation and KV cache reuse.
SDLMs are defined by the Next Sequence Prediction objective, unifying token-level and block-level generation. During inference, SDLMs dynamically select the longest high-confidence unmasked prefix from each block (\emph{Longest Prefix Decoding}), thus integrating block-wise diffusion with autoregressive adaptation. The hybrid loss is: where denotes the target block, and the masked block input (Liu et al., 28 Sep 2025).
2. Model Design and Attentional Mechanics
SDLMs operate on fixed-size mask blocks , but eschew rigid block prediction by dynamically confirming the decoded subsequence length based on prediction confidence. The custom attention mask is pivotal: it retains strict causal (left-to-right) structure for historical tokens, while employing bidirectional attention within each block. This enables rich intra-block context modeling during denoising.
Pre-trained ALMs (e.g., Qwen-2.5, LLaMA) can be retrofitted by fine-tuning with the SDLM objective and attention mask, using as few as 3.5M training samples. No architectural changes are necessary for cache compatibility, minimizing retraining cost and maximizing deployment practicality.
During inference, the model generates a candidate block of tokens in parallel, then applies
where the decoded output length is determined by the maximum prefix such that
for per-token confidence scores (softmax or entropy-normalized) and threshold . This stratagem enables throughput scaling and error mitigation (Liu et al., 28 Sep 2025).
3. Dynamic Decoding and Error Robustness
Unlike block diffusion—where prediction length per iteration is static—SDLMs leverage confidence-adaptive decoding to avoid error propagation in semantically or syntactically uncertain regions. If the model computes low confidence (softmax or entropy) for tokens within a block, it only outputs the longest high-confidence prefix, leaving the remaining positions for further candidate refinement in subsequent passes.
This dynamic prediction horizon imparts robustness, as the model adaptively shrinks the output span in difficult contexts or expands it in more predictable ones. Cascading errors prevalent in traditional long-span generation are mitigated by this selective acceptance mechanism.
4. Scalability and Throughput Performance
SDLMs exhibit pronounced scalability. Empirical results show that SDLM models can match or surpass leading ALMs in performance while achieving up to 2.1× higher throughput (e.g., versus Qwen-2.5), averaging about 2 tokens per forward pass. Scaling SDLM architecture to 32B parameters (SDLM-32B) yields even greater acceleration, with consistent quality retention across diverse tasks such as GSM8K (math), HumanEval (code), and knowledge-centric benchmarks (Liu et al., 28 Sep 2025).
The ability to retrofit existing models ensures that SDLM inherits scaling laws and efficiency benefits established for large ALMs, as attested by marked gains in mathematical reasoning, code synthesis, and unconstrained generation tasks.
5. Model Integration and KV Cache Compatibility
A fundamental limitation of prior diffusion models is incompatibility with KV caches, precluding fast incremental decoding. Through blockwise mask scheduling and retention of causal attention for history, SDLM recovers full KV cache compatibility for both prefill and generation blocks. This enables deployment using common LLM serving stacks and resource-efficient rollout.
Retrofit training with custom mask blocks ensures that cache states remain accessible and correctly updated during both training and inference, preserving the practical advantages of ALMs.
6. Mathematical Characterization and Algorithmic Formulation
Key mathematical formulations characterizing SDLM include:
Objective | Formula | Description |
---|---|---|
Next-Token Prediction (ALM) | Standard causal LM objective | |
Block Diffusion Loss | Conventional blockwise diffusion loss | |
SDLM Training Objective | NSP-based hybrid training loss for sequential blocks | |
Dynamic Decoding Rule | Adaptive prefix selection for mask blocks |
These formulations allow for scheduling per-block confidence and length, dynamically aligning speed and reliability during generation.
7. Implications, Applications, and Future Directions
SDLM models serve as a hybrid generative framework bridging the parallel efficiency of diffusion models and the robust deployment capabilities of autoregressive transformers. Applications include rapid text completion, scalable code generation, mathematical reasoning, and any domain requiring flexible, adaptive-length sequence modeling.
The approach also lays foundational advances for future directions:
- Extension of dynamic block decoding to multimodal and multitask scenarios.
- Integration of speculative verification and entropy-based adaptation for finer error control.
- Further scaling of model capacity with minimal retraining cost via retrofitting.
- Deployment in settings with stringent throughput and latency requirements (e.g., conversational agents, code assistants, real-time summarization).
By reconciling operation in mask blocks with autoregressive serving, SDLMs position diffusion-based generation for practical adoption and continued innovation in large-scale LLMing (Liu et al., 28 Sep 2025).