Diffusion via Autoregressive Models (D-AR)

Updated 25 November 2025

Diffusion via Autoregressive Models (D-AR) is a generative technique that merges sequential autoregressive token prediction with iterative diffusion denoising for efficient synthesis.
It enables coarse-to-fine streaming generation by mapping token groups to diffusion steps, allowing progressive image and multimodal previews.
Empirical findings on benchmarks like ImageNet show competitive sample fidelity and inference efficiency, demonstrating improvements over traditional methods.

Diffusion via Autoregressive Models (D-AR) is a class of generative modeling techniques that recast the iterative refinement process characteristic of diffusion models as, or tightly coupled with, an explicit autoregressive (AR) modeling paradigm. D-AR frameworks unify or hybridize the next-token generative strengths of autoregression—sequential dependency modeling, cache-friendly inference, flexible conditioning—with the multi-step denoising, iterative sampling, and continuous or discrete state dynamics of diffusion models. State-of-the-art D-AR models span image, video, graph, audio, and sequence tasks, achieving competitive or superior sample quality and inference efficiency over conventional approaches. This entry systematically details the principles, canonical architectures, mathematical foundations, and empirical properties of Diffusion via Autoregressive Models, with a primary focus on the formulation in "D-AR: Diffusion via Autoregressive Models" (Gao et al., 29 May 2025) and key evolutions in the literature.

1. Formalism and Model Architecture

D-AR models recast the generation of structured data—such as images—into a token-by-token autoregressive process tightly coupled with a diffusion-decoding architecture. Given an input image $I \in \mathbb{R}^{H \times W \times 3}$ , a learned tokenizer maps $I$ to a sequence of $N$ discrete latent tokens $z = (z_1, ..., z_N)$ via a transformer encoder and vector-quantization (\texttt{VQ-VAE}):

$z = \mathrm{quant}_\gamma\left(\mathcal{E}(I, [q_1, ..., q_N])\right)$

Here, $q_1,...,q_N$ are learnable queries and $\mathrm{quant}_\gamma(\cdot)$ is a quantizer with codebook size $K_e$ and embedding dimension $d_e$ .

The token sequence is autoregressively modeled:

$p_\theta(z) = \prod_{i=1}^N p_\theta(z_i | z_1, ..., z_{i-1})$

The tokens are partitioned into $K$ contiguous groups $g_j$ , each associated with a particular diffusion denoising step $t_j$ . Each group conditions a denoising stage in the pixel or latent space, so that token order naturally corresponds to a coarse-to-fine spatial progression. The generative pipeline thus alternates between token generation (AR) and diffusion-step decoding, enabling a streaming, preview-capable generation process (Gao et al., 29 May 2025).

2. Tokenization and Diffusion Step Mapping

The D-AR tokenizer is trained via a composite objective incorporating:

Vector-quantization loss: $\ell_{\mathrm{VQ}}$
Flow-matching loss: $\ell_{\mathrm{fm}}$ on raw pixels (Equation (1) of (Gao et al., 29 May 2025)):

$\ell_{\mathrm{fm}} = \mathbb{E}_{t, x_0, x_1}\|v_t - \mathcal{D}_{\mathrm{FM}}(x_t,t,c(t))\|_2^2$

where $x_t = t x_1 + (1-t)x_0$ , $v_t = x_1 - x_0$ .

Perceptual loss $(\ell_{\mathrm{LPIPS}})$ and representation-alignment loss $(\ell_{\mathrm{repa}})$

Tokens are partitioned so that each group $g_j$ serves as the AR-conditional for the j-th diffusion step $t_j$ , with $t_j = [(j/K)] / [(j/K) + \beta(1-j/K)]$ for $\beta>1$ . Early groups encode high-noise, global structure; later groups add low-noise, fine detail.

This mapping enables D-AR models to realize a natural hierarchy: global layout to fine texture, strictly aligned to the stepwise nature of diffusion model denoising.

3. Training and Decoding Workflow

The core AR model is a decoder-only transformer (e.g., Llama-architecture), trained with standard next-token cross-entropy:

$\ell_{\mathrm{AR}}(\theta) = -\sum_{i=1}^N \log\,p_\theta(z_i | z_{<i})$

During generation, as tokens $z_1, ..., z_N$ are sequentially predicted, the completion of each group triggers a corresponding diffusion denoising operation at $t_j$ on the evolving image state $x_t$ , using the partial token sequence as the context for decoding.

Consistent streaming previews are realizable: after generating a token prefix, the system can produce a low- or mid-resolution estimate of the final image via a jump-estimate formula $\hat{x}_1 = (1-t) v_t + x_t$ .

A generic decoding scheme:

for i in range(1, N+1):
    z_i = AR_model.next_token(z_1, ..., z_{i-1})
    if i % (N/K) == 0:
        j = i // (N/K)
        x = one_diffusion_step(x, t_j, condition=z_{(j-1)*(N/K)+1: j*(N/K)})

4. Key Properties and Advantages

D-AR architectures inherit and extend favorable traits of both AR and diffusion modeling:

Causal and KV-cache-friendly inference due to pure next-token prediction—all AR tokens are generated in sequence, maximizing parallelizable infrastructure for language modeling (Gao et al., 29 May 2025).
Coarse-to-fine streaming generation by mapping token positions to denoising times enables consistent image previews and efficient human-in-the-loop synthesis.
Zero-shot layout- and constraint-controlled generation: By clamping an initial AR token subset, D-AR can guide subsequent token and pixel generation without retraining.
Unified next-token API enables diffusion-based image, video, or sequence generation to be integrated into large-scale language modeling frameworks with minimal architectural disturbance.
Competitive or improved sample fidelity and recall on canonical benchmarks versus classic AR or diffusion-only models.

5. Empirical Findings and Ablations

Experiments on ImageNet 256 $\times$ 256 (class-conditional) demonstrate:

Model	Params	FID $\downarrow$	IS $\uparrow$	Prec	Rec
D-AR-L	343M	2.44	262.97	0.78	0.61
D-AR-XL	775M	2.09	298.42	0.79	0.62

Ablation studies indicate that best results are achieved with 8 ODE diffusion steps (Adams–Bashforth), and FID improves from 7.38 to 2.44 as more token prefix is used in jump-estimation. The method is on par with, or superior to, baselines such as LlamaGen and hybrids like MaskGIT, RAR, and VAR. The tokenizer achieves $r\text{FID}=1.52$ compared to LlamaGen's $2.19$ for equal budget (Gao et al., 29 May 2025).

6. Discussion: Technical Implications, Limitations, Extensions

Recasting diffusion as an AR process facilitates:

Streamlined integration with LLM infrastructure and hardware for efficient visual or multi-modal synthesis.
Streaming pixel update capabilities and previews suitable for interactive or human-guided generation.
Exact AR likelihood evaluation, facilitating compression and quantitative analysis (Hoogeboom et al., 2021).

Limitations include sub-linearity in scaling experiments (current models $<1$ B parameters), unaddressed native multi-modal text-image generation, and the potential for further enhancements via advanced quantization or dynamic grouping schedules.

Extensions such as improved quantizers, dynamic step schedules, and incorporating richer cross-modal context are identified as promising directions. The paradigm is inherently general, with documented applications in graphs, video, raw waveform synthesis, and even data assimilation in dynamical systems [ARLON: (Li et al., 2024); DiffAR: (Benita et al., 2023); (Kong et al., 2023); (Srivastava et al., 8 Oct 2025)].

7. Relationships to Broader Literature and Methodological Variants

Multi-scale AR and diffusion as latent iterative refiners: VAR models formalize the D-AR connection via deterministic Laplacian pyramids, discrete code classification, and scale-wise AR decoding, highlighting the structural equivalence to discrete/latent diffusion (Hong et al., 3 Oct 2025).
Blockwise D-AR hybrids for sequence generation: Block diffusion and SDAR partition sequences into blocks, achieving AR conditioning across blocks and parallel diffusion within, bridging performance and efficiency gaps between AR and diffusion approaches (Arriola et al., 12 Mar 2025, Cheng et al., 7 Oct 2025).
Video and high-dimensional generative modeling: D-AR is leveraged in asynchronous and RL-enhanced video diffusion models, where AR temporal factorization is enforced alongside stochastic or deterministic denoising at each frame (Sun et al., 10 Mar 2025, Zhao et al., 9 Oct 2025).
Practical acceleration: Diffusion step annealing (DiSA) dynamically reduces per-token diffusion steps as the AR process proceeds, based on empirical observation of growing conditional constraint, yielding up to $10\times$ inference speedup with negligible quality loss (Zhao et al., 26 May 2025).

The D-AR paradigm provides a rigorous, extensible, and empirically validated framework unifying discrete token generation, iterative refinement, and parallelizable inference in high-dimensional generative learning.