Next-Token Diffusion Synthesis

Updated 24 January 2026

Next-token diffusion synthesis is a generative modeling paradigm that integrates diffusion processes with autoregressive token prediction, enhancing control and scalability.
It employs both discrete and continuous formulations to support efficient streaming generation and block-wise parallel decoding.
The approach achieves superior sample quality and faster sampling by uniting denoising benefits with robust context handling.

Next-token diffusion synthesis refers to a generative modeling paradigm that blends diffusion-based generative processes with the autoregressive (AR) next-token prediction structure common to modern LLMs. Rather than denoising an entire sequence simultaneously as in conventional diffusion probabilistic models (DPMs), next-token diffusion predicts or denoises individual tokens or small blocks in an autoregressive or context-causal order. This fusion yields models that combine the controllability and sample quality of diffusion with the scalability, streaming, and context-handling of AR transformers. Next-token diffusion synthesis has been instantiated for discrete and continuous domains, and serves as a bridge to unified multimodal and block-wise parallel generation.

1. Mathematical Foundations and Model Formulations

Next-token diffusion marries per-token diffusion processes—either continuous (e.g., DDPM-style) or discrete (e.g., categorical masking)—with the canonical AR factorization of sequence models.

Discrete Domains:

For discrete tokens (e.g., VQ-VAE codes or text tokens), the forward (noising) process is often defined as a categorical Markov chain: $q(z_t \mid z_{t-1}) = \mathrm{Cat}(\;z_t;~\alpha_t\,\mathbf{1}_{z_{t-1}} + (1-\alpha_t)\,\tfrac{1}{K}\mathbf{1}\;)$ with $\alpha_t \in [0,1]$ interpolating between identity and uniform noise. The reverse model predicts the denoised token: $p_\theta(z_{t-1} \mid z_t) = \mathrm{Cat}(z_{t-1};~\mathrm{softmax}(f_\theta(z_t, t)))$ as in RDPM (Wu et al., 2024).

Continuous Domains:

For continuous-valued latents (images, audio, etc.), the forward process follows DDPM noise schedules: $q(x_t | x_{t-1}) = \mathcal N(x_t; \sqrt{1-\beta_t}x_{t-1}, \beta_t I)$ or equivalently

$x_t = \sqrt{\overline{\alpha}_t} x_0 + \sqrt{1-\overline{\alpha}_t}\,\epsilon$

with the reverse model predicting $\epsilon$ from the noisy $x_t$ and context (Yang et al., 14 Jul 2025, Sun et al., 2024).

Autoregressive–Diffusion Factorization:

Learning proceeds via next-token cross-entropy or per-token denoising objectives:

Discrete:

$\mathcal{L}_{\mathrm{gen}} = -\mathbb{E}\left[\sum_{i} \log p_\theta(z_{i}^{0} | \text{context}, z_{i}^{t})\right]$

Continuous:

$\mathcal{L}_\mathrm{Diff} = \mathbb{E}_{t, \epsilon} [\|\epsilon - \epsilon_\theta(x_t, t, \text{context})\|^2]$

This formulation recovers standard AR modeling as a limiting case (block-size $b=1$ in (Tian et al., 7 Dec 2025)) and allows seamless adaptation to block-wise or fully parallel diffusion.

2. Architectural Paradigms and Implementation

Discrete Token Models

RDPM (Wu et al., 2024):
- Encodes images via VQ-VAE (codebook size $K=4096$ ) and applies diffusion on code indices ( $z_t\in \{1,\ldots,K\}^{h\times w}$ ).
- Transformer $f_\theta$ rolls out T recurrent prediction steps, using AdaLN-Zero for timestep conditioning and concatenating class/text conditions.
- Cross-entropy loss is structurally identical to GPT-style AR models.
D-AR (Gao et al., 29 May 2025):
- Encodes images to 1D discrete sequences via transformer and VQ quantization.
- Diffusion time $t$ is mapped to token groups, enabling a coarse-to-fine decoding schedule.
- Standard causal Transformer maximizes $\sum_{i=1}^N \log p_\theta(z_i|z_{<i})$ ; diffusion decoding is performed via groupwise ODE steps.

Continuous-Token and Latent Models

LatentLM (Sun et al., 2024):
- VAE encodes images/audio into continuous latents $z_i$ .
- Each $h_i$ (Transformer output) conditions a diffusion head for $z_i$ prediction; discrete tokens use a standard LM head.
- Training optimizes a joint loss: cross-entropy for discrete tokens, diffusion loss for continuous.
AudioNTP/AudioMNTP (Yang et al., 14 Jul 2025):
- Audio is encoded into continuous patches via VAE, becoming $x_i\in\mathbb{R}^d$ .
- For each $x_i$ , diffusion denoising is conditioned on preceding latents $x_{1:i-1}$ and text embeddings.
- Masked next-token prediction (MNTP) randomly skips prior tokens, improving long-range dependency modeling.

Sequence and Blockwise Extensions

Diffusion Forcing (Chen et al., 2024):
- Each token is assigned an independent noise level $k_t$ ; the model denoises variable patterns, supporting variable-length and causal prediction.
- The ELBO objective covers all possible noise assignments over the sequence.
- Causal RNN or masked Transformer architectures are used.
NBDiff (Tian et al., 7 Dec 2025):
- AR decoding is cast as block-diffusion with $b=1$ ; context-causal attention masks maintain AR semantics in the prefix and bidirectionality in active (block) regions.
- Parallel training recycles AR loss for stability and efficient adaptation as block size grows.

3. Inference, Sampling, and Guidance Mechanisms

Next-token diffusion enables efficient and flexible sampling regimes:

Fast Sampling:

Discrete next-token diffusion models (e.g., RDPM) reduce inference to as few as 10 transformer passes, massively outperforming conventional diffusion models requiring hundreds of denoising steps (Wu et al., 2024).

Streaming and Online Synthesis:

AR and context-causal diffusion permit left-to-right generation, streamability, and on-the-fly previewing. In D-AR, tokens unlock incremental denoising, enabling preview of partial outputs (Gao et al., 29 May 2025).

Classifier-Free and Classifier Guidance:

Guidance is applied by mixing conditional and unconditional logits (RDPM), stepwise scaling (D-AR), or guidance gradients as in classifier-free diffusion, even in variable-noise architectures like Diffusion Forcing (Chen et al., 2024).

Mask-based Parallelism:

Context-causal and block-causal attention masks support scalable adaptation from AR to block-wise bidirectional generation (NBDiff), increasing throughput without sacrificing AR advantages (Tian et al., 7 Dec 2025).

z_hat_0 = 0
for t in range(1, T+1):
    eps_t = normal(0, 1)
    logits = f_theta(eps_t, y, t, z_hat_{t-1})
    C_hat_t = argmax_k[logits_k + Gumbel(tau)]
    v_hat_t = codebook[C_hat_t]
    z_hat_t = z_hat_{t-1} + v_hat_t
x_hat = D(z_hat_T)
return x_hat

4. Empirical Performance and Capabilities

Benchmarks demonstrate that next-token diffusion attains or exceeds the fidelity and diversity of both AR and full-sequence diffusion baselines, with substantial sampling speedups.

Method / Model	ImageNet 256 FID	IS	Step Count	Remarks
RDPM, 20-layer	2.56	295.1	10	Discrete VQ latent, competitive w/ DiT
D-AR-XL (775M)	2.09	298.4	256 steps	AR tokens, strong streaming/preview
LatentLM-L (479M)	2.24	253.8	20 (sample)	Continuous latents, mixed discrete/cont
AudioMNTP-Large	--	--	256 tokens	Outperforms AudioGen in FAD/KL/Oval
NBDiff-7B-Base	--	--	4 passes	Macro-avg 64.3% on LLM benchmarks

For text, next-token and block-diffusion models (NBDiff) rival or surpass AR and prior DLM approaches in code, math, and general LLM tasks (Tian et al., 7 Dec 2025).
In audio, MNTP outperforms discrete AR and prior diffusion models with strong FAD/Frechet Distance and subjective evaluations (Yang et al., 14 Jul 2025).
Task-specific innovations, such as masked NTP and seed- or hash-conditioning, further enhance sample diversity, creative capacity, and non-myopic planning (Nagarajan et al., 21 Apr 2025).

5. Applications, Extensions, and Limitations

Unification Across Modalities:

Next-token diffusion provides a pathway to joint modeling of text, images, audio, and video within a single transformer backbone. LatentLM demonstrates simultaneous discrete/continuous AR synthesis, and RDPM posits future extensions to video and joint text-image models (Sun et al., 2024, Wu et al., 2024).

Controllability and Planning:

Diffusion Forcing supports causal guidance, variable-horizon rollouts, and Monte Carlo Tree Guidance for planning and reinforcement learning; uniquely, noisy future tokens preserve uncertainty for policy inference (Chen et al., 2024).

Architectural Scalability:

Next-token diffusion is compatible with efficient caching and streaming (via AR), scales to billion-parameter models, and enables flexible adaptation to block-wise or partially parallel decoding (Tian et al., 7 Dec 2025).

Limitations and Open Challenges:

Discrete diffusion relies on external tokenizers (VQ-VAEs); end-to-end continuous codebooks remain to be fully developed (Wu et al., 2024).
Step scheduling, codebook size, and noise parameters introduce novel tuning axes.
Sampling in continuous diffusion remains relatively slow unless advanced solvers or discretization-aware architectures are leveraged (Chen et al., 2024, Sun et al., 2024).

6. Theoretical and Conceptual Considerations

The unification of diffusion with next-token prediction addresses several limitations inherent in purely AR or traditional diffusion models:

Myopia of AR:

Standard next-token prediction is locally greedy and fails to plan global latent structures, as documented in algorithmic creativity tasks (Nagarajan et al., 21 Apr 2025). Diffusion's multi-token and non-causal objectives enable modeling of higher-order dependencies.

ELBO and Likelihood Bounds:

Per-token or per-subsequence diffusion (Diffusion Forcing) optimizes a valid evidence lower bound (ELBO) over all noise patterns, not just fully noised or fully clean sequences (Chen et al., 2024).

Blockwise Decoding as a Generalization:

AR decoding is shown to be a special case ( $b=1$ ) of blockwise diffusion, allowing a principled adaptation pathway to increased throughput and bidirectionality (NBDiff) (Tian et al., 7 Dec 2025).

Empirically, next-token diffusion demonstrates superior sample diversity, robustness to compounding errors, and improved creative reasoning compared to conventional AR models, supporting its adoption in both unimodal and multimodal generative systems.