Autoregressive Diffusion Models

Updated 7 November 2025

Autoregressive Diffusion Models (ARDMs) are a unified generative framework that blends sequential, order-sensitive factorization with iterative denoising processes for flexible and parallel data generation.
They employ a single neural network with order-agnostic masking and dynamic grouping, enabling coarse-to-fine upscaling and efficient lossless compression with fewer computational steps.
Empirical results demonstrate ARDMs’ superior performance in text, image, and audio modeling, achieving high quality with reduced step counts compared to traditional autoregressive and diffusion models.

Autoregressive Diffusion Models (ARDMs) are a unified class of generative models that blend the sequential, order-sensitive data factorization of autoregressive models with the iterative denoising processes of diffusion models. This synthesis enables ARDMs to support arbitrary generation orders, parallel prediction, coarse-to-fine upscaling, and practical lossless compression, while often achieving greater empirical and computational efficiency than classic autoregressive or diffusion schemes alone (Hoogeboom et al., 2021).

1. Theoretical Foundations and Model Class

ARDMs generalize both order-agnostic autoregressive models (ARM, e.g., NADE, MADE [Uria et al., 2014]) and absorbing-state discrete diffusion models (e.g., D3PM [Austin et al., 2021]). Formally, for data $\mathbf{x} \in \mathcal{X}^D$ , the ARDM lower-bounds the log-likelihood as: $\log p(\mathbf{x}) \geq \mathbb{E}_{t \sim \mathcal{U}(1,\ldots,D)} \left[ D \cdot \mathcal{L}_t \right],$ with per-step objective

$\mathcal{L}_t = \frac{1}{D-t+1} \mathbb{E}_{\sigma \sim \mathcal{U}(S_D)} \sum_{k \in \sigma(\geq t)} \log p(x_k | \mathbf{x}_{\sigma(<t)}),$

where $\sigma$ is a data-dimension permutation and masked variables are predicted in parallel at each training step. This generalizes the chain rule of ARMs by permitting learning and sampling in any (randomized) order.

Sampling in ARDM starts with all variables in a masked (absorbed) state. At each step, using a sampled permutation, a new set of variables is unmasked and predicted, conditioned on all previously assigned variables. The process runs for $D$ steps for $D$ variables. Parallel prediction of multiple variables is supported via dynamic grouping, an advance over fully sequential autoregressive generation.

ARDMs also admit upscaling (depth factorization), where variables are generated in coarse-to-fine or bit-level groups (e.g., most-to-least significant bits for images), further broadening the model space compared to canonical AR or diffusion models.

2. Architectural Characteristics and Implementation

ARDMs employ a single neural network (typically transformer or convolutional), trained to predict masked variables from observed context. Unlike classic ARMs, causal masking in self-attention is not required—masking is instead specified by order and step index during training.

Training regime: At each batch, a random order $\sigma$ and step $t$ are sampled, and the model is trained to predict all unassigned variables given the prefix context.
Sampling regime: Variables are initialized in a fully masked state, and are predicted stepwise or in parallel groupings according to the learned masking patterns.

Parallel generation is facilitated by dynamic programming algorithms that trade off between the number of network calls and desired likelihood degradation. This enables sampling with an adjustable computational budget, in sharp contrast to the fixed fully sequential ARM or the long trajectory discrete diffusion.

Furthermore, ARDMs natively support upscaling: variables can be partitioned into "stages" (e.g., color channels, bit planes), and the generation graph decomposes as a product across upscaling transitions, with each transition defined by a customizable transition matrix.

3. Empirical Results: Performance and Efficiency

Text Modeling (Text8): On character-level benchmarks, OA-ARDM achieves 1.43 bits-per-character (bpc) with 250 steps, surpassing D3PM-absorbing (1.47 bpc at 256 steps), and requiring far fewer steps than D3PM to match that performance.

Image Modeling (CIFAR-10): ARDM and upscaling ARDM reach 2.69 and 2.64 bits-per-dimension (bpd) on 3072-step and staged-upscale variants, outperforming alternative discrete-diffusion and many ARM baselines. Even when sampling with as few as 4 × 50 parallel steps, ARDMs maintain strong performance (2.68 bpd), highlighting their computational efficiency.

Audio Modeling (SC09): Upscale ARDM achieves 6.29 bpd versus WaveNet's 7.77 bpd.

ARDMs achieve these results with dramatically fewer steps than absorbing diffusion models. For instance, D3PM typically requires 1000+ steps for comparable likelihood; ARDMs reach strong test log-likelihoods with 250–500 steps and degrade gracefully when sampling with even fewer.

4. Distinctive Properties and Practical Advantages

Property	ARMs	Discrete Diffusion	ARDMs
Causal masking	Required	Not needed	Not needed
Generation order	Fixed	Pre-set (absorbing)	Arbitrary/random
Single-step training	No	Often no	Yes
Parallel generation	No (fully seq.)	No (per-timestep seq.)	Yes, for arbitrary groupings
Lossless per-image compression	Problematic	High overhead	Efficient, minimal calls
Upscaling (multi-stage, coarse-fine)	Not native	Not native	Native

Simplicity: No need for causal masking or order-specific reparameterization. Order-agnostic masking facilitates implementation for structured data (e.g., images, sequences).
Efficiency: Single-step-per-example batching and dynamic parallel generation yield major speedups in both training and inference.
Graceful degradation: Likelihood/sampling quality degrades gradually with parallelization, unlike the severe loss observed in bits-back-based compression or traditional diffusion when sampling with fewer steps.
Lossless compression: Exact per-sample encoding and decoding is possible with a single order and only modest network calls, avoiding dataset-level interdependence of bits-back schemes and high per-image overhead of alternatives.
Upscaling and recursion: ARDMs support flexible upscaling, enabling multi-stage generation suitable for coarse-to-fine or bit-planed data representation.

5. Theoretical Equivalences and Extensions

ARDMs subsume both OA-ARMs and discrete absorbing diffusion [(Hoogeboom et al., 2021), Section 3 & Appendix]. In the limit of large numbers of absorbing states, ARDMs converge to OA-ARMs; reciprocally, with appropriate transition operators, absorbing diffusion is a special case (continuous-time limit).

The theoretical lower bounds on log-likelihood are provably tight [Eq. (1),(2) in the paper], and single-random-step optimization per batch preserves scalability even for complex data modalities.

ARDMs' upscaling factorization allows the joint modeling of variables with structurally natural generation regimens (e.g., "from structure to detail" or "bits to pixels"), a capability not available in standard ARMs or diffusion models.

6. Lossless Compression Application

ARDMs are uniquely suited for lossless compression:

The model provides exact next-symbol probabilities for any chosen order.
Compression can be performed on a single data point via entropy coding (e.g., rANS), avoiding the need for complex bits-back coding, dataset overhead, or decompression of full corpora.
In empirical CIFAR-10 experiments, ARDM upscaling models match or outperform previous state-of-the-art neural compressors on per-image bpd, setting a new benchmark for neural lossless compression with practical decoding [(Hoogeboom et al., 2021), Table 3].

Model	Steps	CIFAR-10 Compression (per image, bpd)
ARDM-Upscale 4	500	2.71
OA-ARDM	500	2.73
IDF++	--	3.26
HiLLoC	--	4.19
FLIF	--	4.19

Graceful performance under step reduction makes ARDMs uniquely practical for on-demand, per-point, resource-bounded applications.

7. Summary and Implications

ARDMs constitute a theoretically principled, empirically strong, and practically flexible class of generative models unifying the benefits of autoregressive and diffusion modeling. Their core advances—random order selection, single-step training, parallelizable sampling, efficient upscaling, and direct compatibility with lossless compression pipelines—address critical limitations of prior approaches in both efficiency and practicality. The ARDM’s framework is compatible with large-scale vision, audio, and sequential data, and is expected to see continued adoption across generative modeling and neural (de)compression research.

PDF Markdown Chat (Pro)

References (1)

Autoregressive Diffusion Models (2021)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Autoregressive Diffusion Models (ARDMs).