Auto-Regressive Diffusion Model

Updated 31 December 2025

Auto-regressive diffusion models are generative frameworks that combine sequential dependency modeling with diffusion-based denoising to incrementally assemble structured data.
They employ a forward process to corrupt data and a reverse process conditioned on previous tokens, achieving high-quality reconstructions across modalities like language, images, and time series.
Key implementation aspects include modular architectures, specialized noise schedules, and optimized inference strategies to balance structural coherence and computational cost.

An auto-regressive diffusion model is a generative modeling framework that unifies progressive, denoising-based diffusion processes with (usually causal) sequential dependency modeling. These models operate by introducing a Markov or stochastic process that incrementally transforms data into noise (the forward process), then defining and learning a denoising (reverse) process, where each variable or group in the data is generated conditional on previous variables, in a specified or learned order. This design enables the model to capture long-range dependencies and structure, while also providing the diversity and sample-quality benefits associated with diffusion models. Compared to vanilla diffusion frameworks, the auto-regressive variant is especially well-suited to sequential, structured, or compositional generation tasks in domains such as language, time series, images, video, motion, 3D shapes, graphs, and multi-modal data.

1. Mathematical Foundations of Auto-Regressive Diffusion Models

The defining characteristic of auto-regressive diffusion models (ARDMs) is the joint use of the chain rule (or causal factorization) and diffusion-based noise injection and denoising. Consider data $x = [x_1,\dots,x_K]$ partitioned into $K$ ordered sub-components (tokens, patches, frames, nodes, etc.). The ARDM factorizes the data distribution as

$p(x) = p(x_1) \prod_{k=2}^K p(x_k | x_{<k})$

where each term $p(x_k | x_{<k})$ is realized by a local diffusion model. The forward process at each stage corrupts $x_k$ (often via Gaussian, discrete, or absorbing noise) conditioned on $x_{<k}$ , while the reverse process learns to denoise $x_k$ back from the noisy distribution, conditioned on all previous variables:

Discrete: absorption or masking of variables (e.g., OA-ARDM (Hoogeboom et al., 2021), GraphArm (Kong et al., 2023)).
Continuous: Gaussian or OU noising (e.g., AR-LDM (Pan et al., 2022), AR motion/video (Shi et al., 2023), TimeDART (Wang et al., 2024), ARMD (Gao et al., 2024), MMAR (Yang et al., 2024)).

This chainwise local diffusion structure allows ARDMs to incrementally assemble the data, filling in one piece at a time, with explicit control over the sequential dependency structure.

2. Core Algorithms: Forward and Reverse Processes

The choice of forward (corruption) and reverse (generation) processes in ARDMs is domain- and datatype-dependent.

Continuous case (Gaussian/OU based): For each $x_k$ , the forward process follows

$q(x^t_k | x^0_k, x_{<k}) = \mathcal N(x^t_k; \sqrt{\bar\alpha_t} x^0_k, (1-\bar\alpha_t)I)$

for a noise schedule $\{\bar\alpha_t\}$ , as in DDPMs. The reverse process is parameterized as

$p_\theta(x_k^{t-1}| x_k^t, x_{<k}) = \mathcal N\left( \mu_\theta(x_k^t, t, x_{<k}), \Sigma_t \right)$

where $\mu_\theta$ is predicted from a neural network based on denoising objectives (typically predicting noise $\epsilon$ or velocity $v$ ).

Discrete absorbing case: Variables are absorbed (converted to a mask or special token) in a (possibly random or learned) order. The reverse process reconstructs variables using a neural conditional distribution over their possible values, with order-agnostic or data-driven orderings (Hoogeboom et al., 2021, Kong et al., 2023).
Full factorization: Each new component is generated by a learned diffusion process

$p(x_{1:K}) = \prod_{k=1}^K \prod_{t=1}^{T^{[k]}} p_\theta( x_k^{t-1} | x_k^t, x_{<k})$

where $T^{[k]}$ may be position-dependent (e.g. in text (Wu et al., 2023)) or shared.

Auto-regressive conditioning: The denoiser at each step is explicitly conditioned on all available prior variables (and possibly their history or auxiliary context), typically implemented via concat/attention/cross-attention as appropriate for the data modality (Pan et al., 2022, Shi et al., 2023, Huang et al., 30 Apr 2025, Kang et al., 30 May 2025).

3. Model Architectures and Implementation Aspects

A range of architectures instantiate ARDMs across modalities:

Vision: U-Nets with cross-attention to prior generated images (Pan et al., 2022, Shen et al., 2024); token-wise DDPMs with masked or learned orderings (Kang et al., 30 May 2025, Gu et al., 19 Nov 2025); 2D latent spaces for video (Weng et al., 2023, Sun et al., 10 Mar 2025).
Language: Transformer models with per-token, per-position noise schedules; left-to-right or custom orderings (Wu et al., 2023).
Time-series: Patch-wise diffusion with auto-regressive transformer encoders for global trends, local denoising for fine-scale structure (Wang et al., 2024, Gao et al., 2024).
Motion/Video: Conditional MLPs or transformer denoisers with AR connection to previous/generated pose/window (Shi et al., 2023, Li et al., 2023, Shi et al., 2024).
Graphs: GNN ordering network for node absorption, GAT for node/edge denoising (Kong et al., 2023).
Multi-modal: LLM backbone with per-token/tile lightweight diffusion heads for continuous visual patches (Yang et al., 2024).

Key implementation patterns include:

Causal/temporal/attention masking for AR conditioning.
History-aware conditioning modules that fuse language, visual, or other contextual signals (Pan et al., 2022).
Modular combination of AR diffusion "generation" and GAN or direct-predictive "draft" modules (Li et al., 2023) for speed.

4. Theoretical Properties, Performance, and Error Analysis

AR diffusion models provably offer enhanced capacity to capture conditional dependencies compared to vanilla (synchronous) diffusion models. Theoretical results in (Huang et al., 30 Apr 2025) show that, under mild assumptions (bounded score error, smooth log-density), the AR variant can drive the sequence of conditional KL divergences (across each generated patch/token) to $O(\epsilon^2)$ , while vanilla diffusion cannot guarantee that each conditional is well-modeled even if the joint is. This directly impacts tasks requiring strong inter-part relationships (consistent motifs, laws of physics, or between-patch/element dependencies).

Practical implications include:

ARDMs incur higher inference cost ( $K\times$ vs synchronous DDPM), but the increased sample-fidelity and structural correctness often justifies it (Huang et al., 30 Apr 2025, Kong et al., 2023).
In settings where the AR dependency aligns with true data structure, AR models achieve higher sample quality, lower loss, or higher $R^2$ on dependency metrics (Huang et al., 30 Apr 2025, Pan et al., 2022, Kong et al., 2023).
For general data with weak or non-aligned dependencies, AR models offer little advantage and may suffer in efficiency (Huang et al., 30 Apr 2025).
ARDMs can be efficiently parallelized or compressed using distillation (Gu et al., 19 Nov 2025) or budgeted sampling schedules (Hoogeboom et al., 2021).

5. Applications Across Modalities

Auto-regressive diffusion models have achieved state-of-the-art or highly competitive results in:

Sequential visual tasks: Visual storytelling/archetypal narratives (Pan et al., 2022), many-to-many image series (Shen et al., 2024), story continuation, novel-view and procedure generation.
Video synthesis: Frame-by-frame AR video generation leveraging AR diffusion for temporal coherence (Weng et al., 2023, Sun et al., 10 Mar 2025), with explicit mechanisms (e.g., non-decreasing timestep constraints, reference/anchor frames, masking) for drift prevention and long-range fidelity.
Time series forecasting: Chain-structured AR moving diffusion (Gao et al., 2024), patchwise AR+diffusion representations (Wang et al., 2024).
Motion and interaction: Real-time online AR diffusion for character control and motion synthesis, including multi-agent/interactive scenarios (Shi et al., 2023, Shi et al., 2024, Li et al., 2023).
Graph generation: Node-wise absorbing AR diffusion for molecular and generic graphs (Kong et al., 2023), enabling enforcement of structural constraints per addition step.
3D generative modeling: Conditional 3D shape generation in latent token space by combining AR factorization with diffusion for smooth structure fidelity (Kang et al., 30 May 2025).
Text and multi-modal modeling: AR-diffusion for language modeling with per-token variable noise steps (Wu et al., 2023), and joint AR diffusion/LLM backbones for scalable, lossless multi-modal modeling (Yang et al., 2024).

A summary table indicating representative ARDMs and their applications:

Type/Modality	Representative Work	Characteristic Approach
Visual Story/Image	AR-LDM (Pan et al., 2022)	AR chain over images, latent-space
Video	ART·V (Weng et al., 2023), AR-Diff (Sun et al., 10 Mar 2025)	Framewise AR with diffusion, masking/reference
Motion	A-MDM (Shi et al., 2023), AAMDM (Li et al., 2023)	AR pose generation, diff/polishing
Graph	GraphArm (Kong et al., 2023), ARDM (Hoogeboom et al., 2021)	Nodewise AR absorption/denoising
Time Series	TimeDART (Wang et al., 2024), ARMD (Gao et al., 2024)	AR patches, sliding/chain diffusion
Language	AR-Diffusion (Wu et al., 2023)	Left-to-right AR, per-token noise
Multi-modal	MMAR (Yang et al., 2024)	AR LLM backbone + per-token diffusion

6. Practical Considerations, Extensions, and Empirical Insights

Inference acceleration: Careful design (e.g., (Gu et al., 19 Nov 2025)) enables 20–30 $\times$ faster AR diffusion inference via distillation/objective matching; parallel/budgeted strategies reduce network calls (Hoogeboom et al., 2021).
Adaptability: AR conditioning enables rapid adaptation (e.g., to new characters in story/image tasks (Pan et al., 2022)), and the factorization allows for principled incorporation of constraints (e.g., in graphs (Kong et al., 2023)) or controls (e.g., in motion (Shi et al., 2023, Li et al., 2023)).
Guidance and control: AR diffusion models facilitate classifier-free guidance by exposing rich conditioning channels for language, vision, or control signals. Reinforcement learning post-training (MARVAL-RL (Gu et al., 19 Nov 2025)) further aligns generation with human-centric or external reward functions, at practical speed.
Theoretical and empirical tradeoffs: ARDMs increase model capacity and conditional fidelity at the expense of greater inference cost; performance gains are substantial for tasks with strong dependency structure, minimal for iid-like data (Huang et al., 30 Apr 2025).
Curriculum and schedule design: Position-dependent noise schedules (Wu et al., 2023), learned or data-driven token/node ordering (Kong et al., 2023), and masking-based distillation (Gu et al., 19 Nov 2025) are critical for optimal ARDM performance.

7. Limitations and Current Research Directions

Limitations of auto-regressive diffusion models include:

Increased computational cost due to sequential/ar factorization, especially for high $K$ (number of steps/nodes/tokens/frames).
Dependency on the chosen partitioning/order; misaligned or poor token/patch decompositions can negate ARDM's benefits (Huang et al., 30 Apr 2025).
Complexity of integrating fast sampling, adaptation, or RL requires additional architectural or optimization developments (Gu et al., 19 Nov 2025).
For domains with weak causal or structural dependencies, diffusion models without AR structure may be sufficient and more efficient (Huang et al., 30 Apr 2025).
Ongoing research explores: learnable or adaptive ordering; fully continuous/overlapping partition strategies; combining ARDMs with advanced SDE/ODE solvers; direct integration with large-scale LLMs and multi-modal transformers; theoretical limits of compositional AR diffusion (especially in lossless compression (Hoogeboom et al., 2021)).

Auto-regressive diffusion modeling therefore constitutes a foundational direction in generative modeling—yielding improvements in coherence, consistency, and controllable structure—whenever sequential or conditional dependencies are both present and well-understood. The field remains highly active across domains and architectural scales, with both rigorous theoretical insight and a rapidly expanding empirical toolkit.