Next-Token Diffusion

Updated 27 August 2025

Next-token diffusion is a generative modeling approach that integrates sequential autoregressive prediction with iterative diffusion denoising to enhance efficiency and control.
It employs learned denoising steps on both discrete and continuous tokens, blending local dependencies with global planning to boost creativity and performance.
The method supports multimodal applications—including text, image, audio, and motion synthesis—while offering improved parallelism and reduced inference steps over traditional techniques.

Next-token diffusion refers to a class of generative modeling techniques that combine the autoregressive, sequential factorization typical of next-token prediction with diffusion-based denoising mechanisms. These approaches unify or interpolate between token-wise autoregressive (AR) models and parallel, multi-token diffusion models, leveraging the strengths of both paradigms. Next-token diffusion has been developed and analyzed across modalities including text, image, audio, video, motion, and multimodal tasks, often with a focus on scalability, efficiency, controllability, and overcoming known limitations of either AR or pure diffusion models.

1. Conceptual Foundations and Formalism

Next-token diffusion leverages both the chain-rule factorization of next-token prediction

$p(x) = \prod_{i=1}^N p(x_i | x_{<i})$

and the iterative denoising process of diffusion models. The core innovation is to introduce a denoising (diffusion) mechanism into the generation of each token or group of tokens, such that the conditional distribution for the next (block of) token(s) is defined or estimated via a diffusion process. This allows modeling and predicting both discrete and continuous-valued tokens, and in some architectures, combines parallel denoising within blocks with sequential AR dependencies across blocks (Huang et al., 20 May 2025).

In continuous-valued settings—such as audio, images, or motion—tokens are represented as continuous vectors. Given a conditioning history and optionally other context, the model predicts each new token by initializing from Gaussian noise and applying a sequence of learned denoising steps. The process can be formalized as: $z^t = \sqrt{\bar{\alpha}_t} z + \sqrt{1 - \bar{\alpha}_t} \epsilon$ where $z$ is the true token, $t$ indexes the diffusion timestep, and $\epsilon \sim \mathcal{N}(0, I)$ . The network is trained to predict either $z$ or $\epsilon$ via a loss such as $E_{t,\epsilon}[\|\epsilon - \epsilon_\theta(z^t, t, h)\|^2]$ , where $h$ is the Transformer hidden state (Sun et al., 11 Dec 2024, Yang et al., 14 Jul 2025).

For discrete token scenarios (e.g., text or code), architectures such as RDPM recast the denoising process as recurrent prediction of discrete codes, using cross-entropy loss over codebook entries, aligning with the objective of GPT-style LLMs (Wu et al., 24 Dec 2024).

2. Mechanism and Architectural Variants

Mechanistically, next-token diffusion can be implemented via several architectural strategies:

Self-conditioned Embedding Diffusion: Diffusion is applied to fixed (often pretrained) continuous embeddings of tokens. At each reverse step, the network receives the current noised embedding and the previous clean estimate as inputs, progressively reducing noise through self-conditioning. This mechanism enables high flexibility in both unconditional and conditional text generation (Strudel et al., 2022).
Token-wise Diffusion in Audio and Motion: For continuous modalities, a lightweight "diffusion head" (MLP) is appended to the Transformer decoder. During training, the model predicts the noise added to the latent token at each step, using a loss function such as mean squared error between the predicted and true noise (Yang et al., 14 Jul 2025, Tanaka et al., 8 Mar 2025). In text-to-motion, both text and motion are embedded into a joint latent space, and a joint loss is optimized for AR and diffusion objectives (Tanaka et al., 8 Mar 2025).
Dynamic Block-wise Diffusion: CtrlDiff segments sequences into blocks, applying discrete diffusion within each block in parallel, while maintaining AR dependencies between blocks. Block size is dynamically adjusted using reinforcement learning based on local semantic complexity, balancing quality and efficiency (Huang et al., 20 May 2025).
Per-token Independent Noise Schedules (Diffusion Forcing): Rather than adding uniform noise across all tokens, independent noise levels are assigned per token. This enables variable-length, causal sequence generation combined with non-causal diffusion guidance (Chen et al., 1 Jul 2024).
Recurrent Discrete Diffusion: In models such as RDPM, the denoising process is reframed as sequential, discrete token prediction over multiple diffusion steps, bridging the gap between diffusion and GPT-style modeling (Wu et al., 24 Dec 2024).

3. Advantages Over Autoregressive and Pure Diffusion Models

Next-token diffusion addresses limitations inherent in purely AR or classical diffusion methods:

Parallelism and Efficiency: In contexts where inference latency or hardware throughput is critical, parallel denoising of multiple tokens within a block (or across tokens with independent noise) matches or exceeds the efficiency of AR sampling while retaining the flexibility of variable-length sequence generation (Strudel et al., 2022, Huang et al., 20 May 2025, Chen et al., 1 Jul 2024).
Controllability: Classifier-free and classifier-guided diffusion within next-token or block-wise regimes supports post-hoc control over attributes (e.g., sentiment, style) via guidance at denoising steps, allowing conditioning without retraining (Huang et al., 20 May 2025).
Handling of Continuous Modalities: Autoregressive LLMs are limited by discrete token vocabularies. With next-token diffusion, continuous latent tokens are generated directly, which is especially beneficial for modalities such as audio, motion, or continuous image latents. This also avoids lossy quantization and reduces sequence lengths compared to VQ approaches (Sun et al., 11 Dec 2024, Yang et al., 14 Jul 2025).
Creative Generation and Global Planning: Classical next-token predictors are known to be "short-sighted" and prone to local optimization, leading to excessive memorization and limited algorithmic creativity. Multi-token denoising (global or within blocks) enables the model to commit to latent global plans—supported empirically by higher creativity and novelty metrics, especially when input-layer randomness (e.g., seed- or hash-conditioning) is used instead of solely output-layer temperature sampling (Nagarajan et al., 21 Apr 2025, Chen et al., 1 Jul 2024).

Next-token diffusion provides a unifying modeling strategy for multimodal generative tasks:

Unified LLM Backbones: Architectures such as MoMug and LatentLM embed both continuous (motion, image, audio) and discrete (text, code) sequences into a common latent space and process them through a single LLM or Transformer backbone with lightweight modifications (e.g., separate heads or LoRA modules). This enables seamless switching between AR text generation and diffusion-based continuous output (Tanaka et al., 8 Mar 2025, Sun et al., 11 Dec 2024).
Tokenization Strategies: Approaches include robust VAE or MoVQGAN-based visual tokenizers (for images, video) or continuous tokenizers with masking and "K-way" encoding for items in recommender systems (Wang et al., 27 Sep 2024, Qu et al., 16 Apr 2025).
Generalized Next-Token Diffusion: By embedding all modalities as (discrete or continuous) tokens, architectures such as Emu3 and D-JEPA-T2I show that next-token (AR or diffusion) strategies can scale to multimodal perception and generation, including high-resolution image and video synthesis (Wang et al., 27 Sep 2024, Chen et al., 22 Nov 2024).

5. Empirical Findings: Performance, Efficiency, and Trade-offs

Recent work systematically addresses the trade-offs and scaling impacts of next-token diffusion:

Comparative Efficiency: In image synthesis, next-token prediction achieves better efficiency (lower FLOPs per inference) and superior prompt following (CLIP scores) at low compute, while diffusion models catch up in image quality (e.g., FID) as compute is scaled up (Kilian et al., 21 May 2024).
Scalability: Diffusion strategies that incorporate block-level AR factorization or allow variable horizons (e.g., Diffusion Forcing) enable stable, long-horizon predictions, critical for time-series, video, and planning (Chen et al., 1 Jul 2024).
Parameter and Inference Efficiency: Next-token diffusion with continuous latent tokens enables fewer parameters and significantly lower inference steps compared to conventional VQ or full-sequence diffusion (e.g., ∼10× fewer decoding steps in TTS) (Sun et al., 11 Dec 2024, Yang et al., 14 Jul 2025).
Diversity and Creativity: Diffusion-enhanced models outperform AR baselines in creative and open-ended algorithmic tasks, as measured by diversity and novelty metrics, particularly when seed-conditioning randomizes generation at the input level (Nagarajan et al., 21 Apr 2025).
Control and Conditionality: Classifier-guided control enables efficient post-hoc conditioning within blocks without retraining, broadening practical applications in controllable text generation (Huang et al., 20 May 2025).

6. Theoretical Underpinnings and Limitations

Theoretical analysis of next-token diffusion reveals several foundational aspects:

Evidence Lower Bound (ELBO) Optimization: Training objectives can be interpreted as optimizing variational lower bounds (ELBOs) on the joint likelihoods of all subsequences, especially when noise levels are independently sampled per token (Chen et al., 1 Jul 2024).
Unified Cross-Entropy Optimization: Discrete diffusion frameworks, such as RDPM, utilize cross-entropy over discrete codebooks, aligning their optimization directly with AR LLMs and enabling coupled multimodal training (Wu et al., 24 Dec 2024).
Global vs. Local Conditioning: Multi-token (whole sequence or block) denoising provides the opportunity for implicit global coordination or planning, unlike strictly local AR approaches (Nagarajan et al., 21 Apr 2025).
Known Challenges: Despite increased flexibility, challenges include maintaining sample efficiency, efficiently reducing the number of required diffusion steps, end-to-end optimization of embedding spaces, and developing proper metrics for complex conditional tasks (Strudel et al., 2022, Chen et al., 1 Jul 2024).

7. Applications and Future Directions

Next-token diffusion has immediate and prospective applications in:

Conditional and Unconditional Text Generation: Models such as Sed and CtrlDiff support both unconditional output and infilling tasks, with flexibility for fine-grained control (Strudel et al., 2022, Huang et al., 20 May 2025).
Image, Audio, Video, and Motion Synthesis: Frameworks for high-resolution images (D-JEPA-T2I), continuous audio (AudioMNTP), video synthesis (Diffusion Forcing), and unified text-motion generation (MoMug) demonstrate competitive or state-of-the-art generation quality and efficiency across modalities (Chen et al., 22 Nov 2024, Yang et al., 14 Jul 2025, Chen et al., 1 Jul 2024, Tanaka et al., 8 Mar 2025).
Recommendation Systems: Continuous-token diffusion provides a path to more expressive, scalable recommendation pipelines that directly mirror collaborative signals and user reasoning (Qu et al., 16 Apr 2025).
Algorithmic Creativity and Open-Ended Generation: Diffusion’s global planning capabilities enable improved performance on tasks requiring true creativity, diversity, and novelty (Nagarajan et al., 21 Apr 2025).
Unified Multimodal LLMs and AGI: By enabling a single model to handle both continuous and discrete, visual and symbolic data, next-token diffusion strategies are positioned as foundational for general-purpose multimodal intelligence (Sun et al., 11 Dec 2024, Wang et al., 27 Sep 2024).

Future research directions include reducing diffusion steps for faster sampling (as achieved in the image domain), improving embedding–diffusion integration, scaling to even larger models, further unifying across modalities (e.g., cross-modal conditionality), and developing new standardized benchmarks for multi-block and multi-token conditional generation (Strudel et al., 2022, Chen et al., 1 Jul 2024, Huang et al., 20 May 2025).

Next-token diffusion thus constitutes a comprehensive family of generative modeling techniques that combine the sequential flexibility, parallel efficiency, and global planning advantages of both autoregressive and diffusion-based approaches, with validated strengths across text, vision, audio, motion, and recommendation system domains.