Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 77 tok/s
Gemini 2.5 Pro 54 tok/s Pro
GPT-5 Medium 29 tok/s Pro
GPT-5 High 26 tok/s Pro
GPT-4o 103 tok/s Pro
Kimi K2 175 tok/s Pro
GPT OSS 120B 454 tok/s Pro
Claude Sonnet 4.5 38 tok/s Pro
2000 character limit reached

Sequential Diffusion Language Model (SDLM)

Updated 30 September 2025
  • The paper introduces Next Sequence Prediction (NSP) to unify token-level and block-level generation, enabling dynamic, high-throughput text synthesis.
  • SDLM leverages both autoregressive and diffusion paradigms for adaptive mask-block inference and seamless key-value caching without architectural changes.
  • Experimental results show SDLM scales effectively, matching state-of-the-art quality while achieving up to 2.1× throughput improvement in diverse generation tasks.

A Sequential Diffusion LLM (SDLM) is a neural text generation paradigm that merges autoregressive sequence modeling and diffusion-based block-wise denoising. SDLMs introduce core advances—such as Next Sequence Prediction (NSP) and dynamic mask-block inference—that enable flexible, efficient, and high-throughput generation in natural language and code domains, while retaining compatibility with essential deployment features such as key-value caching. By adaptively retrofitting pre-trained autoregressive LLMs, SDLMs achieve strong scalability and can outperform or match state-of-the-art autoregressive baselines in quality and throughput (Liu et al., 28 Sep 2025).

1. Modeling Framework and Theoretical Foundation

Sequential Diffusion LLMs build upon two complementary paradigms: autoregressive LLMs (ALMs) and discrete diffusion models (DLMs). In ALMs, text is generated token by token using a strictly causal attention window, optimizing the objective

LALM(x;θ)=Ex[i=1LlogPθ(xix<i)]L_{ALM}(x; \theta) = -\mathbb{E}_x \left[\sum_{i=1}^L \log P_\theta(x^i | x^{<i})\right]

with inference supported by key-value (KV) caches.

In standard DLMs, masked blocks are denoised in parallel, by minimizing the block-diffusion loss

LBD(X;θ)=i=1L/DEt[0,1]Eq[αt1αtlogPθ(XiX<i,Xti)]L_{BD}(X; \theta) = -\sum_{i=1}^{L/D} \mathbb{E}_{t \sim [0,1]} \mathbb{E}_q \left[ \frac{\alpha_t'}{1-\alpha_t} \log P_\theta(X^i | X^{<i}, X_t^i) \right]

but fixed-length outputs and blockwise rigidities preclude flexible sequence generation and KV cache reuse.

SDLMs are defined by the Next Sequence Prediction objective, unifying token-level and block-level generation. During inference, SDLMs dynamically select the longest high-confidence unmasked prefix from each block (\emph{Longest Prefix Decoding}), thus integrating block-wise diffusion with autoregressive adaptation. The hybrid loss is: L(X;θ)=EX,XT[1DilogPθ(Xix<(i1),XTi)]L(X; \theta) = -\mathbb{E}_{X, X_T} \left[ \frac{1}{D} \sum_i \log P_\theta(X^i | x^{<(i-1)}, X_T^i) \right] where XiX^i denotes the target block, and XTiX_T^i the masked block input (Liu et al., 28 Sep 2025).

2. Model Design and Attentional Mechanics

SDLMs operate on fixed-size mask blocks DD, but eschew rigid block prediction by dynamically confirming the decoded subsequence length based on prediction confidence. The custom attention mask is pivotal: it retains strict causal (left-to-right) structure for historical tokens, while employing bidirectional attention within each block. This enables rich intra-block context modeling during denoising.

Pre-trained ALMs (e.g., Qwen-2.5, LLaMA) can be retrofitted by fine-tuning with the SDLM objective and attention mask, using as few as 3.5M training samples. No architectural changes are necessary for cache compatibility, minimizing retraining cost and maximizing deployment practicality.

During inference, the model generates a candidate block of DD tokens in parallel, then applies

X^i=Decode(Zi,γτ(Zi))\hat{X}^i = \mathrm{Decode}(Z^i, \gamma_\tau(Z^i))

where the decoded output length γτ(Zi)\gamma_\tau(Z^i) is determined by the maximum prefix jj such that

k=1jp(zik)τ\prod_{k=1}^j p(z_i^k) \geq \tau

for per-token confidence scores p(zik)p(z_i^k) (softmax or entropy-normalized) and threshold τ\tau. This stratagem enables throughput scaling and error mitigation (Liu et al., 28 Sep 2025).

3. Dynamic Decoding and Error Robustness

Unlike block diffusion—where prediction length per iteration is static—SDLMs leverage confidence-adaptive decoding to avoid error propagation in semantically or syntactically uncertain regions. If the model computes low confidence (softmax or entropy) for tokens within a block, it only outputs the longest high-confidence prefix, leaving the remaining positions for further candidate refinement in subsequent passes.

This dynamic prediction horizon imparts robustness, as the model adaptively shrinks the output span in difficult contexts or expands it in more predictable ones. Cascading errors prevalent in traditional long-span generation are mitigated by this selective acceptance mechanism.

4. Scalability and Throughput Performance

SDLMs exhibit pronounced scalability. Empirical results show that SDLM models can match or surpass leading ALMs in performance while achieving up to 2.1× higher throughput (e.g., versus Qwen-2.5), averaging about 2 tokens per forward pass. Scaling SDLM architecture to 32B parameters (SDLM-32B) yields even greater acceleration, with consistent quality retention across diverse tasks such as GSM8K (math), HumanEval (code), and knowledge-centric benchmarks (Liu et al., 28 Sep 2025).

The ability to retrofit existing models ensures that SDLM inherits scaling laws and efficiency benefits established for large ALMs, as attested by marked gains in mathematical reasoning, code synthesis, and unconstrained generation tasks.

5. Model Integration and KV Cache Compatibility

A fundamental limitation of prior diffusion models is incompatibility with KV caches, precluding fast incremental decoding. Through blockwise mask scheduling and retention of causal attention for history, SDLM recovers full KV cache compatibility for both prefill and generation blocks. This enables deployment using common LLM serving stacks and resource-efficient rollout.

Retrofit training with custom mask blocks ensures that cache states remain accessible and correctly updated during both training and inference, preserving the practical advantages of ALMs.

6. Mathematical Characterization and Algorithmic Formulation

Key mathematical formulations characterizing SDLM include:

Objective Formula Description
Next-Token Prediction (ALM) LALM(x;θ)=Exi=1LlogPθ(xix<i)L_{ALM}(x; \theta) = - \mathbb{E}_x \sum_{i=1}^L \log P_\theta(x^i | x^{<i}) Standard causal LM objective
Block Diffusion Loss LBD(X;θ)=i=1L/DEt[0,1]Eq[(αt/1αt)logPθ()]L_{BD}(X; \theta) = -\sum_{i=1}^{L/D} \mathbb{E}_{t \sim [0,1]} \mathbb{E}_q [ (\alpha_t'/1-\alpha_t) \log P_\theta(\cdot) ] Conventional blockwise diffusion loss
SDLM Training Objective L(X;θ)=EX,XT[1DilogPθ(Xix<(i1),XTi)]L(X; \theta) = - \mathbb{E}_{X, X_T} \left[ \frac{1}{D} \sum_i \log P_\theta( X^i | x^{<(i-1)}, X_T^i ) \right] NSP-based hybrid training loss for sequential blocks
Dynamic Decoding Rule γτ(Zi)=max{j[1,D]k=1jp(zik)τ}\gamma_\tau(Z^i) = \max \{ j \in [1, D] | \prod_{k=1}^j p(z_i^k) \geq \tau \} Adaptive prefix selection for mask blocks

These formulations allow for scheduling per-block confidence and length, dynamically aligning speed and reliability during generation.

7. Implications, Applications, and Future Directions

SDLM models serve as a hybrid generative framework bridging the parallel efficiency of diffusion models and the robust deployment capabilities of autoregressive transformers. Applications include rapid text completion, scalable code generation, mathematical reasoning, and any domain requiring flexible, adaptive-length sequence modeling.

The approach also lays foundational advances for future directions:

  • Extension of dynamic block decoding to multimodal and multitask scenarios.
  • Integration of speculative verification and entropy-based adaptation for finer error control.
  • Further scaling of model capacity with minimal retraining cost via retrofitting.
  • Deployment in settings with stringent throughput and latency requirements (e.g., conversational agents, code assistants, real-time summarization).

By reconciling operation in mask blocks with autoregressive serving, SDLMs position diffusion-based generation for practical adoption and continued innovation in large-scale LLMing (Liu et al., 28 Sep 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Sequential Diffusion Language Model (SDLM).