Auto-Regressive Diffusion Models (ARDMs)

Updated 8 January 2026

ARDMs are generative models that integrate sequential autoregressive decomposition with iterative diffusion processes to capture strong conditional dependencies.
They employ rigorous mathematical frameworks including categorical predictions, Gaussian noise injection, and ELBO-based training to ensure scalable learning and efficient inference.
Applications span diverse modalities such as text, imagery, video, and time series, enabling practical solutions in compression, prediction, and controlled generation.

Auto-Regressive Diffusion Models (ARDMs) constitute a unified class of generative models that integrate the sequential decomposition characteristic of auto-regressive models with the flexible iterative refinement paradigm of diffusion processes. In ARDMs, data generation proceeds via a sequence of conditional denoising steps, each of which can depend on an explicit or learned history, enabling the model to capture strong conditional dependencies, temporal evolution, and multimodal structures across a broad spectrum of modalities including text, imagery, time series, 3D objects, and video. ARDMs generalize and encompass prior forms such as order-agnostic autoregressive models and absorbing-state discrete diffusion, exhibiting scalable training and highly adaptable inference regimes suitable for compression, prediction, and controlled generation (Hoogeboom et al., 2021).

1. Mathematical Foundations and Formulation

ARDMs are characterized by a latent process that progressively corrupts data points—whether discrete (by random masking) or continuous (by Gaussian noise injection)—along a predetermined or data-adaptive trajectory, followed by a learned, conditional, and often autoregressive reverse trajectory that reconstructs the high-fidelity output. In their archetypal discrete form, ARDMs model a data vector $x \in \{1,\ldots,K\}^D$ by randomly permuting and masking coordinates (absorbing-diffusion), then reconstructing one coordinate per step with a categorical prediction conditional on the current (partially masked) context: $p_\theta(x) = \mathbb{E}_{\sigma \sim \mathrm{Unif}(S_D)}\; \prod_{t=1}^D p_\theta(x_{\sigma(t)} \mid x_{\sigma(<t)})$ where $\sigma$ is a random ordering, and each $p_\theta$ is parameterized by a neural network (Hoogeboom et al., 2021).

In continuous-space ARDMs, the forward process typically uses a stagewise Ornstein–Uhlenbeck (OU) or linear Gaussian diffusion, with either patchwise or tokenwise conditional dependency, and the reverse process is parameterized by a score network or denoising function, often leveraging a Transformer or U-Net backbone. For sequential domains (e.g., time series, language, motion, video), the autoregressive structure may be enforced via causal masking, stagewise conditional SDEs, or variable-noise schedules that assign fewer denoising steps to earlier positions, thereby creating a left-to-right generative dependency (Wu et al., 2023, Shen et al., 2024, Sun et al., 10 Mar 2025).

2. Training Objectives and Inference Regimes

The canonical ARDM training objective is an evidence lower bound (ELBO) or denoising score-matching loss, which can be efficiently estimated via importance-weighted stochastic approximation: $\mathcal{L} = D \cdot \mathbb{E}_{t \sim U(1,\dots,D),\,\sigma}\bigl[ \log p_\theta(x_{\sigma(t)}\mid x_{\sigma(<t)})\bigr]$ or, in continuous cases, via mean square error between corrupted and reconstructed latent states. Conditional diffusion variants optimize noise-prediction or $x_0$ -prediction losses across timesteps and sequence positions, using context-sensitive conditioning (Hoogeboom et al., 2021, Huang et al., 30 Apr 2025, Wu et al., 2023).

Inference in ARDMs may be fully sequential (greedy AR decoding), blockwise (parallelization over groups of coordinates), or distillation-accelerated, e.g., via the MARVAL framework that collapses inner diffusion chains into one-step AR generation through guided score matching (Gu et al., 19 Nov 2025). The AR structure enables efficient history-conditioned sampling, minimum Bayes-risk decoding for N-best selection, and strategies for low-latency or streaming inference in domains like VSR (Shiu et al., 29 Dec 2025) or video (Sun et al., 10 Mar 2025).

3. Conditional Modeling and Contextual Dependencies

A core motivation for ARDMs is their capacity to capture nontrivial conditional dependence structures that vanilla (fully synchronous) diffusion models systematically miss, as rigorously shown in theoretical analyses of conditional KL gaps (Huang et al., 30 Apr 2025). By factorizing the data distribution into sequential conditionals—whether over spatial patches, sequence positions, or temporal steps—ARDMs match the true compositional structure of modalities marked by high-order dependencies, such as language, physical systems, and video. Stagewise ARDMs can model

$p_*(x_{1:K}) = p_{*,1}(x_1)\;\prod_{k=1}^{K-1}p_{*,k+1|[1:k]}(x_{k+1}|x_{1:k})$

and the corresponding reverse chains reconstruct each $x_{k+1}$ given the prior context (Huang et al., 30 Apr 2025).

In practice, ARDMs instantiate these dependencies via causal attention masks (e.g., in M2M and TimeDART), history-aware encoders (e.g., CLIP-BLIP in AR-LDM (Pan et al., 2022)), or explicit cross-attention/fusion strategies (e.g., prefix learning in LTM3D (Kang et al., 30 May 2025)). For video, a non-decreasing timestep constraint and temporal causal attention enforce that later frames are denoised only using earlier or concurrent information, enabling asynchronous generation with preserved temporal coherence (Sun et al., 10 Mar 2025).

4. Algorithmic Diversity and Application Domains

ARDMs have been adapted to an array of application domains. In motion synthesis and control, models such as AAMDM (Li et al., 2023) and A-MDM (Shi et al., 2023) utilize AR denoising to generate long-horizon motions satisfying physical and contextual constraints, achieving advances in fidelity–efficiency trade-offs through two-stage GAN-diffusion hybrids or interactive RL-based control.

For multi-image and story generation, M2M (Shen et al., 2024) and AR-LDM (Pan et al., 2022) enable coherent multi-modal visual sequence generation, leveraging auto-regressive attention over history and fine-tuned conditioning for novel views or procedure steps. In time series, ARDMs such as TimeDART (Wang et al., 2024) and ARMD (Gao et al., 2024) achieve state-of-the-art performance in representation learning and forecasting by coupling AR transformers with diffusion decoders or ARMA-inspired deterministic devolution networks.

Table: Selected ARDM Application Domains

Domain	ARDM Reference	Key Mechanism
Text	(Wu et al., 2023)	Position-dependent denoising
Image/Story	(Shen et al., 2024, Pan et al., 2022)	Image-set attention, history fusion
Video	(Sun et al., 10 Mar 2025, Weng et al., 2023)	Temporal causal attention, AR denoising
Motion	(Li et al., 2023, Shi et al., 2023)	AR framewise denoising, RL control
3D Generation	(Kang et al., 30 May 2025)	AR sequence in token space
Time Series	(Wang et al., 2024, Gao et al., 2024)	Patchwise/cumulative AR denoising
Data Assimilation	(Srivastava et al., 8 Oct 2025)	ARDMs with control augmentation
Compression	(Hoogeboom et al., 2021)	Parallel ARDM coding, upscaling

5. Theoretical Guarantees and Error Analyses

ARDMs enjoy provable advantages in capturing conditional laws and compositional rules. Theoretical results quantify how, for datasets with strong inter-patch dependencies, ARDMs achieve a lower bound on conditional KL error not accessible to vanilla DDPMs—unless the AR factorization is well aligned with the true dependency structure of the data (Huang et al., 30 Apr 2025). Error analyses for AR-video diffusion (Wang et al., 12 Mar 2025) reveal two inevitable phenomena: error accumulation (linearly with AR steps) and a memory bottleneck, the latter provably irreducible for any finite-window model, establishing an information-theoretic Pareto frontier between long-term fidelity and inference efficiency.

Parallel and accelerated ARDMs (such as MARVAL) show that distillation of the diffusion process into single-step autoregressive predictors (guided score matching) enables $20$– $30\times$ inference acceleration without compromising image fidelity, expanding the practical scope to RL post-training and controllable generation (Gu et al., 19 Nov 2025).

6. Architectural Innovations and Efficiency Strategies

To address the inference and memory trade-offs inherent to ARDMs, models employ a variety of architectural strategies. These include:

Hierarchical two-stage generation: e.g., Denoising Diffusion GANs for coarse drafts, followed by AR-diffusion “polishing” for high fidelity (Li et al., 2023).
Conditional attention and memory fusion: compression modules and attention-based merging of past frames in video and procedural models, mitigating the memory bottleneck (Wang et al., 12 Mar 2025).
Masked AR and flexible ordering: models such as MAR and its MARVAL distillation, which combine groupwise AR sampling orderings with inner diffusion chains, then compress both into efficient single-pass generative models (Gu et al., 19 Nov 2025).
Prefix learning and reconstruction guidance: cross-modal embedding alignment (image/text to latent 3D tokens) and early-step sample fusion to reduce uncertainty (Kang et al., 30 May 2025).
Low-dimensional latent modeling: for complex output spaces, e.g., embedded pose spaces in motion synthesis, which reduces computational burdens and generalizes better (Li et al., 2023).

7. Empirical Performance, Limitations, and Outlook

Empirical results across domains consistently demonstrate ARDMs’ superiority in capturing sequence dependencies, long-horizon coherence, and diverse sample quality relative to both synchronous diffusion and non-AR baselines. Examples include substantially improved FID and temporal-consistency metrics in video (Sun et al., 10 Mar 2025), task-aligned MSE/MAE in time series (Gao et al., 2024), and prompt-consistency/diversity in multi-image and story generation (Shen et al., 2024, Pan et al., 2022). ARDMs also enable flexible compression tasks, reaching near state-of-the-art per-image bits-per-dimension with modest computational budgets (Hoogeboom et al., 2021).

Current limitations include nontrivial inference overhead scaling linearly with the AR factorization granularity, a memory bottleneck that constrains effective long-term conditional modeling, and data regime specificity—i.e., AR advantages are most pronounced with clear sequential dependencies or compositional rules (Huang et al., 30 Apr 2025, Wang et al., 12 Mar 2025). Ongoing work focuses on parallel/accelerated sampling, learned context compression, and domain-specialized factorization schemes to further balance scalability and fidelity. The ARDM paradigm offers a principled and extensible framework for generative modeling in any domain where tractable, semantically meaningful decompositions exist.