SDLM: Sequential Diffusion Language Models

Updated 11 November 2025

Sequential Diffusion Language Models (SDLMs) are generative models that integrate iterative diffusion denoising with autoregressive transformers for dynamic, variable-length decoding.
They employ a hybrid next sequence prediction framework and blockwise diffusion to fuse bidirectional and causal attention, enabling improved generation fidelity and reduced computational costs.
SDLMs offer enhanced throughput through KV-cache compatibility and error-tolerant, non-Markovian denoising, making them adaptable to retrofitting pre-trained models.

Sequential Diffusion LLMs (SDLMs) constitute a class of generative models for language that combine iterative denoising via diffusion processes with the sequential flexibility and hardware efficiencies of autoregressive transformers. SDLMs reconcile the parallelism intrinsic to discrete diffusion with the dynamic, adaptive scaling and inference optimizations of causal LLMs. Research in this area rigorously investigates both Markovian and non-Markovian forward processes, model architectures that unify bidirectional and causal attention, and novel training/inference paradigms to improve generation fidelity, diversity, speed, and controllability.

1. Motivation and Conceptual Foundations

Diffusion LLMs (DLMs) denoise sequences of discrete tokens through iterative refinement, commonly by traversing a sequence of increasingly corrupted (noisy) versions of the data. Standard DLMs apply either parallel denoising at each timestep (as in Masked Discrete Diffusion) or blockwise approaches, both of which introduce key limitations:

Fixed-length decoding: Classic DLMs generate entire output sequences or blocks of a pre-fixed length, forcing padding/truncation and inhibiting adaptive inference.
KV-cache incompatibility: Their attention patterns require recomputation of key/value states at every diffusion step, incurring an O(L²T) cost for sequence length L and T diffusion steps.
Markovian constraints: Most DLMs restrict each step to condition solely on the current state, preventing corrections to earlier errors during denoising.

Block Diffusion partially mitigates these issues by employing localized diffusion within autoregressive blocks, but still demands a fixed block size and from-scratch training, limiting practical adoption. SDLMs aim to unify the best of autoregressive and diffusion-based modeling: supporting both variable blockwise/sequencewise decoding and compatibility with key-value caching, and are retrofittable to pre-trained causal LLMs with minimal architectural changes (Liu et al., 28 Sep 2025).

2. Modeling Methodologies

SDLMs employ a hybrid generative paradigm that flexibly combines sequential next-token prediction and block-level diffusion-based denoising.

Next Sequence Prediction (NSP) Framework

The foundational SDLM design pattern is the Next Sequence Prediction (NSP) objective (Liu et al., 28 Sep 2025). Let $x^1,\ldots,x^L$ denote the target sequence, which is generated in contiguous blocks of variable and dynamically-determined length $\ell_i$ according to

$P(x^1:L) = \prod_{i=1}^{M} P\left(x^{s_i:s_i+\ell_i-1} \mid x^{< s_i}\right),$

where $s_1 = 1,~s_{i+1} = s_i+\ell_i$ , and $M$ upper-bounds $L/D$ for maximum block size $D$ . The block emission length $\ell_i$ is chosen by a function $\gamma(Z^i)$ operating on denoising logits $Z^i$ from a masked block, e.g., by greedily thresholding per-token confidences or by speculative verification against secondary model passes.

When $D=1$ , this reduces to standard left-to-right autoregressive modeling.

Diffusion Blockwise Generation

SDLMs apply full masking (extreme discrete diffusion) within each block: the next $D$ tokens are masked, and the model is tasked with recovering the clean tokens using (i) bidirectional attention within the block, and (ii) causal attention to the prefix. This yields decoding flexibility—enabling adaptive emission of variable-length subsequences—while preserving compatibility with KV-caching:

At each generation step, the model only needs to compute KVs for the new block, sidestepping redundant recomputation for the prefix.
Decoding continues blockwise, dynamically determining the number of tokens to commit based on output confidence.

This algorithm can retrofit arbitrary pretrained transformers: at $D=1$ the training loss coincides with next-token autoregression. Fine-tuning for a single epoch on a few million examples is sufficient for adaptation (Liu et al., 28 Sep 2025).

Representative Algorithms and Architectures

Several major SDLM variants have emerged:

CaDDi (Causal Discrete Diffusion): Recasts discrete diffusion as a non-Markovian process where the reverse (denoising) model conditions on the entire noisy trajectory, i.e.,

$p_\theta(x_{t-1}|x_{t:T}) ~\text{or for $x_0$-param.:}~ p_\theta(x_0 | x_{t:T}),$

implemented in a causal transformer with a 2D extension of rotary positional embeddings to index both token and diffusion time (Zhang et al., 13 Feb 2025).

SSD-LM / SSD-2: Models token blocks in the logits simplex $\mathbb{R}^{|V|}$ , using Gaussian forward noising, blockwise cross-entropy training, and bi-directional blockwise attention. SSD-2 fuses large and small diffusion models at inference time via blockwise contrastive blending of logits (Han et al., 2023).
SFDLM (State-Fourier Diffusion LLM): Eschews transformers altogether in favor of a U-Net-style stack of structured state-space (SSM) modules for local context and complex Fourier MLP modules for global mixing, with all denoising operating via time-and-frequency decomposition (Kiruluta et al., 16 Mar 2025).

These designs variously specialize in retrofitting, architectural innovation, or inference efficiency.

3. Mathematical Formulations

The essential SDLM generative flow consists of: a stochastic noising process, and a learned reverse model for denoising. The precise instantiation varies as follows:

Diffusion (Noising) Processes

Non-Markovian Forward (CaDDi):

$q(x_{0:T}) = q(x_0) \prod_{t=1}^T q(x_t|x_0),~ \text{with}~ q(x_t|x_0) = \mathrm{Cat}(x_t; x_0 \cdot \overline Q_t),$

where $\overline Q_t$ is a mixture of absorbing and uniform transitions. Each $x_t$ acts as a complementary noisy view of $x_0$ , establishing statistical dependence across the entire trajectory.

Markovian Forward (SSD-LM): Additive Gaussian noise in the logits simplex, parameterized via a fixed schedule $(\beta_1,\ldots,\beta_T)$ , where

$q(w_t|w_{t-1}) = \mathcal{N}(w_t; \sqrt{1-\beta_t} w_{t-1}, \beta_t K^2 I).$

Token Replacement (SFDLM): Each token is replaced with a uniformly sampled vocabulary token with probability $\beta_t$ at each step:

$q(x^t_i|x^{t-1}_i) = \beta_t \cdot \pi(x^t_i) + (1-\beta_t) \cdot \delta[x^t_i=x^{t-1}_i].$

Denoising (Reverse) Models

CaDDi: Reverse model $p_\theta(x_{t-1}|x_{t:T})$ , predicting the clean sequence conditioned on the entire noisy future, implemented via a causal transformer with 2D RoPE.
SSD-LM: Transformer $\epsilon_\theta(x_t, c, t)$ predicting "clean" logits from the current noised input and context; blockwise cross-entropy loss.
SFDLM: State-space U-Net stack; at each layer, input representations are updated by both local SSM convolution and global Fourier MLP mixing.

Training Objectives

For blockwise masking (SDLM): shifted cross-entropy over masked blocks,

$\mathcal{L}_{\text{SDLM}} = - \mathbb{E}_i \frac{1}{D} \sum_{k=1}^D \log P_\theta(x^{i+k-1}|x^{<i}, X_T^i).$

In CaDDi, the per-token loss is

$\mathcal{L}^{t, i}_{\text{token}} = \mathbb{E}[-\log p_\theta(x_0^{(i)}|C_{t,i})],$

where $C_{t,i}$ includes both future denoising steps and known left context.

These losses are consistent with maximizing ELBOs (for non-Markovian variants) or minimizing cross-entropy with reweighting according to the diffusion schedule.

4. Architectural Innovations

The technical underpinnings of SDLMs span transformer-based and transformer-free designs.

2D Positional Encoding: CaDDi introduces a two-dimensional extension of rotary positional embeddings (RoPE), with position-dependent rotations applied independently for sequence index and diffusion timestep. This ensures compatibility and initialization alignment with 1D RoPE in existing LMs, permitting seamless parameter transfer (Zhang et al., 13 Feb 2025).
Blockwise Attention Patterns: SDLMs construct custom attention masks to enable bidirectional attention within masked blocks and causal attention to prefixes, allowing efficient parallel block training and maintenance of KV-cache compatibility (Liu et al., 28 Sep 2025).
Structured State Space and Fourier Layers: SFDLM leverages state-space models (SSM) for local, linear-time convolutional updates and frequency-domain Fourier MLPs for capturing long-range dependencies, yielding O(N log N) scaling and strong ablation evidence for the complementary effect of each component (Kiruluta et al., 16 Mar 2025).
Blockwise Inference-Time Fusion: SSD-2 supports collaborative blockwise inference between large generalist diffusion LMs and small, specialized models by blending their logits with contrastive weighting, enabling effective ensemble synergies that surpass tokenwise AR ensembles (Han et al., 2023).

These architectural adjustments allow SDLMs to inherit the strengths of both fully parallel and left-to-right decoding, while facilitating versatile decoding schedules and inference optimizations.

5. Training, Adaptation, and Efficiency

A central advantage of SDLMs is their retrofittability:

Pretraining and Fine-tuning: Authors demonstrate that SDLMs based on block masking objectives can directly reuse pretrained autoregressive LM weights, with only a minor attention-mask and loss reshaping required. Only a single fine-tuning epoch on $\sim$ 3.5M examples suffices to recover competitive performance relative to supervised fine-tuning (SFT) baselines.
Self-conditioning: Feeding past denoising predictions as additional input for later denoising steps (probability 0.5) accelerates convergence, especially in SSD-LM (Han et al., 2023).
Scaling to Large Models: Experiments include SDLMs built atop Qwen-2.5 (3B, 32B) and SSD-LM up to 13B parameters, with efficiency and scaling properties on par with large autoregressive transformers but with enhanced throughput (Liu et al., 28 Sep 2025, Han et al., 2023).
Semi-speculative Decoding: Both CaDDi and mainline SDLMs utilize speculative and confidence-driven block emission, striking a trade-off between quality and latency. Self-speculative decoding can yield 3.5–6.0 tokens per step depending on block size, with minimal loss in accuracy (Liu et al., 28 Sep 2025, Zhang et al., 13 Feb 2025).

These mechanisms collectively reduce inference calls and support dynamic-length sequence generation, providing 2–3x speedups over AR baselines while preserving generation quality.

6. Empirical Results and Benchmarks

Recent SDLM implementations have been evaluated on language, protein, and conditional generation tasks:

Model	Dataset/Task	PPL	Other Metrics	Notable Results
CaDDi (Zhang et al., 13 Feb 2025)	Protein (AcyP), LM1B	~67.6 (LM1B)	pLDDT, TM, RMSD, H-prob	Outperforms all discrete diffusion baselines in PPL and accuracy.
SDLM (Liu et al., 28 Sep 2025)	GSM8K, general SFT benchmarks	<0.1 over SFT	Throughput, Block-length, Accuracy	Achieves 2.1x speedup with only 0.1 accuracy drop (Qwen-2.5-3B, D=4, τ=0.98).
SSD-2 (Han et al., 2023)	Dolly-15K, Vicuna ICL	BERTScore, PPL	Human preference scores	SSD-2 collaboration wins 45.7% vs OPT 26.6%.
SFDLM (Kiruluta et al., 16 Mar 2025)	PTB, WikiText-103, C4	~15 (1.2B PTB)	Ablation PPL increases	Removing either SSM or Fourier degrades PPL by 5–7 points.

Key findings include:

CaDDi and SDLM both attain near-autoregressive quality at greatly improved throughput (2.1× for SDLM, halved NFE for CaDDi-Spec).
Protein generation with CaDDi achieves state-of-the-art metrics in structural and diversity benchmarks.
SFDLM achieves perplexity rivaling mid-size transformers, with hardware-efficient scaling.
SSD-2 enables novel, effective inference-time model fusions not possible with classic AR LMs.

7. Theoretical Insights and Future Directions

A defining theoretical property of non-Markovian SDLMs is their strict mutual information gain: encoding future noisy states $x_{t+1:T}$ in the reverse denoiser $p_\theta(x_{t-1}|x_{t:T})$ reduces entropy, providing more robust and error-tolerant generation than Markovian approaches (Proposition 3.1, (Zhang et al., 13 Feb 2025)). By exploiting multiple independent noisy views of the ground truth sequence, non-Markovian denoisers can systematically correct earlier mistakes.

Research directions suggested in recent literature include:

Multi-step diffusion within blocks to further smooth confidence transitions and enhance semantic chunking (Liu et al., 28 Sep 2025).
End-to-end learning of the $\gamma$ halting function for adaptive block emission as a differentiable policy.
Extending SDLMs to multimodal and code-specialized tasks via richer blockwise noising strategies.
Integration with broader speculative sampling and non-monotonic decoding frameworks.

Empirical results and scaling evidence suggest SDLMs will remain central to the trade-off between controllability, throughput, and generation fidelity in language and sequence modeling.