Autoregressive Diffusion Transformers

Updated 25 February 2026

Autoregressive Diffusion Transformers are neural architectures that combine autoregressive modeling and diffusion processes for efficient, high-quality data synthesis.
They partition the generation process into a global autoregressive stage for structure and a diffusion stage for iterative refinement of local details.
Empirical results across images, video, audio, and time series show these models deliver enhanced fidelity and controllable synthesis compared to standalone approaches.

Autoregressive Diffusion Transformers are a class of neural architectures that unify and integrate the strengths of autoregressive (AR) modeling and diffusion-based generative modeling within the Transformer framework. These hybrid models have demonstrated substantial advances in high-fidelity image, audio, video, and time series generation across multiple domains. By partitioning the generative process into autoregressive and diffusion components—either along the token/spatial axis, the depth of the network, or both—these models achieve a balance between long-range dependency modeling and fine-grained iterative refinement. As a result, they enable more efficient, flexible, and controllable synthesis than either class of models alone.

1. Foundations: Integrating Autoregression and Diffusion

Autoregressive modeling factorizes the joint probability of complex data into a product of conditionals, capturing global dependencies by generating each new block or token conditioned on prior generations. Diffusion models, by contrast, learn to denoise data from a noise process, typically providing superior sample fidelity and diversity but at high computational cost due to many iterative steps.

Autoregressive Diffusion Transformers leverage both paradigms, typically by decomposing the generation process into (a) an AR stage that establishes global structure or sequence, and (b) a diffusion stage that stochastically refines local details using iterative denoising. This division can occur along the network depth (“vertical mixing” of layers), the data sequence (“blockwise” AR over spatial/temporal segments), or both (Chen et al., 9 Jun 2025, Hu et al., 2024, Deng et al., 2024).

The general approach is to first use an AR module or stage to produce a coarse or structured conditioning signal—often by predicting the next block, chunk, or patch given the priors—and then apply a conditional diffusion process (such as a DDPM or flow matching model) on the predicted block to obtain high-fidelity details. This has been realized across image (Chen et al., 9 Jun 2025, Zhen et al., 11 Jun 2025), video (Song et al., 11 Aug 2025, Li et al., 2024, Chen et al., 17 Nov 2025), audio (Liu et al., 2024, Jia et al., 6 Feb 2025), layout (Wang et al., 2023), and time series (Zhang et al., 6 Feb 2026) modalities.

2. Architectural Recipes and Mathematical Formulation

Blockwise Partitioning and Conditioning

A common architectural motif divides the sequence into blocks. For example, in MADFormer, images are partitioned into non-overlapping spatial blocks which are treated as "tokens" for the AR Transformer layers (Chen et al., 9 Jun 2025). Within each block, bidirectional self-attention and diffusion steps refine local content. The joint density is factorized as

$p(x) = \prod_{i=1}^B p(x_i|x_{<i})$

where each $x_i$ is a continuous block, and the conditional $p(x_i|x_{<i})$ is realized via the AR and diffusion stages. The AR stage establishes a context embedding $\mathbf{z}_{\mathrm{cond}}$ per block, which is then provided to the conditional reverse-diffusion process: $p_\theta(x_{t-1} | x_t, z_{\mathrm{cond}}) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t, z_{\mathrm{cond}}), \sigma_t^2 I)$

Layerwise Mixing: Vertical Composition

Several architectures vertically interleave AR and diffusion layers within a single decoder-only Transformer. For $N$ total layers with $D$ designated for diffusion, the first $N-D$ layers operate autoregressively across blocks, and the remaining $D$ layers perform conditional diffusion denoising within each block. Layer allocation is dataset- and compute-dependent: under strict compute budgets, AR-heavy splits outperform diffusion-heavy splits; with more compute, deeper diffusion stages can surpass AR (Chen et al., 9 Jun 2025).

Generalized Attention Masking

Transformers used in these hybrids employ carefully designed attention masks to ensure proper information flow. Blockwise partitioning demands a "skip-causal" or "blockwise" causal mask that allows each block or noisy token to attend only to the clean prefix (i.e., previous blocks/tokens). For instance, ACDiT (Hu et al., 2024) implements a double-partitioned mask where noisy blocks attend only to clean blocks to the left and themselves, while clean blocks are causal among themselves.

Joint Training Objectives

The combined training objective typically sums the AR loss (MSE or negative log-likelihood over next-block prediction) and the diffusion denoising loss (standard score matching objective): $\mathcal{L}_{\mathrm{total}} = \lambda_{\mathrm{AR}} \, \mathcal{L}_{\mathrm{AR}} + \lambda_{\mathrm{diff}} \, \mathcal{L}_{\mathrm{diff}}$ where $\mathcal{L}_{\mathrm{AR}}$ may be cross-entropy or MSE, and $\mathcal{L}_{\mathrm{diff}}$ is the denoising MSE (e.g., as in DDPM).

3. Algorithmic Schemes and Inference

A typical generation procedure iterates autoregressively over blocks:

For each block $i$ $i$ :
- Compute AR context $\mathbf{z}_{\mathrm{cond}}$ from previous blocks.
- Initialize block with noise $x_T \sim \mathcal{N}(0, I)$ .
- For $t = T, \ldots, 1$ , iteratively denoise $x_t$ via Transformer-based diffusion layers conditioned on $\mathbf{z}_{\mathrm{cond}}$ .
- Set block $i$ to the final denoised output.

This mechanism supports both efficient KV-caching on clean past blocks (for speed) and flexible trade-offs between global structure and local detail (Chen et al., 9 Jun 2025, Hu et al., 2024).

4. Empirical Achievements and Design Insights

Autoregressive Diffusion Transformers have achieved state-of-the-art or competitive results across several domains:

FFHQ-1024 image synthesis: blockwise AR-diffusion markedly improves FID at fixed compute (e.g., FID reduction by up to 75% compared to pure diffusion at low inference compute) (Chen et al., 9 Jun 2025).
ImageNet-256: hybrid AR-diffusion outperforms standalone counterparts and even accelerates sampling speed by 50–100× (Zhen et al., 11 Jun 2025, Hu et al., 2024).
Video: integrating AR modules for temporal coherence and VQ-VAEs for compression, models such as ARLON deliver long-form, dynamic, and temporally consistent generation with competitive overall metrics (Li et al., 2024).
Audio: ARDiT and DiTAR demonstrate superior performance in zero-shot and speech editing tasks by leveraging blockwise AR and diffusion with flow matching and temperature-controlled sampling (Liu et al., 2024, Jia et al., 6 Feb 2025).
Layout/geometry: Dolfin-AR outperforms previous layout models in capturing alignment, overlap, and semantic grouping (Wang et al., 2023).
Time series: dual-stream AR-diffusion Transformers for forecasting (DiTS) effectively factor Temporal and Variate attention, leading to SOTA performance on covariate-aware benchmarks (Zhang et al., 6 Feb 2026).

Key findings include:

Finer AR block granularity benefits high-resolution tasks by proportional division of global context.
Under limited function evaluations (NFE), assigning more layers to AR is preferable; with ample compute, deeper diffusion layers yield finer detail (Chen et al., 9 Jun 2025).
Block size $B$ controls the interpolation between pure AR and pure diffusion, enabling fine-grained speed/quality trade-offs (Hu et al., 2024).
Explicit cross-block attention and conditioning are crucial; removing them results in severe ablation penalties (e.g., >5× worse FID) (Chen et al., 9 Jun 2025).
Blockwise/KV-caching reduces generation overhead, making these models competitive in terms of inference latency compared to NLP-only architectures (Hu et al., 2024, Zhen et al., 11 Jun 2025).

5. Domain-Specific and Multimodal Extensions

Video and Long-Horizon Generation

Local and global autoregressive mechanisms are instrumental for maintaining identity, smoothness, and consistency in video synthesis. For example, LaVieID integrates facial-component routers and temporal AR modules for identity-preserving video (Song et al., 11 Aug 2025), while ARLON couples an AR Transformer with a DiT backbone to bridge long-range temporal dependencies and dynamic motion (Li et al., 2024). The RAD framework adds recurrent memory (LSTM) to diffusion Transformers, enabling generation of infinitely long videos with global memory retention beyond the attention window (Chen et al., 17 Nov 2025).

Multimodal and Self-Supervised Uses

Causal Diffusion Transformers, such as CausalFusion, extend the paradigm beyond visual data to support both discrete and continuous modalities, multimodal captioning, and in-context editing by dual-factorizing over both sequence and noise axes (Deng et al., 2024). Models like TransDiff (Zhen et al., 11 Jun 2025) and D-AR (Gao et al., 29 May 2025) demonstrate that engineering the diffusion process to be mirrorable by "vanilla" next-token AR models, using tailored tokenizers, supports unified architectures for both text and image generation.

Audio and Speech

AR-Diffusion hybrids (ARDiT, DiTAR) for TTS and speech support nearly perfect zero-shot synthesis and editing by combining continuous latent representations, blockwise AR, and diffusion, sometimes with Integral KL (IKL) distillation to collapse iterative sampling to a single step (Liu et al., 2024, Jia et al., 6 Feb 2025).

Time Series Forecasting

DiTS models autoregressive dependencies along the temporal axis and cross-variate dependencies along the feature axis using a dual-stream Diffusion Transformer, optimizing flow-matching losses and exploiting low-rank properties for efficiency (Zhang et al., 6 Feb 2026).

6. Limitations, Trade-offs, and Design Guidance

Autoregressive Diffusion Transformers are subject to several core trade-offs:

Block size vs. context: Smaller blocks approach standard AR and capture fine, temporally extended dependencies but may lack local context for diffusion; larger blocks yield more parallelism and local context, but can lose long-range conditioning (Hu et al., 2024).
Vertical layer allocation: The optimal division between AR and diffusion layers depends on NFE and task fidelity requirements. Under compute constraints, AR dominance is preferable; for ultimate visual quality and high NFE, deep diffusion is optimal (Chen et al., 9 Jun 2025).
Memory for long sequences: Explicit recurrent modules (LSTM states) or block/history caches grow with context length, and scaling beyond certain horizons may require memory compression, e.g., using state-space models or compressed KV-memories (Chen et al., 17 Nov 2025).
Inference cost: While far more efficient than pure diffusion, blockwise AR-diffusion frameworks still demand ODE/denoising steps per block or patch, but their overall cost is controllable via block size and distillation (Liu et al., 2024, Chen et al., 9 Jun 2025).
Applicability across modalities: Attentive mask design, conditional context reuse, and domain-specific blocks, such as spatial routers or temporal AR modules, are often required for optimal performance in new modalities.

7. Representative Models and Comparative Summary

Model	Domain	Block Partitioning	Vertical AR-Diffusion Split	Distillation/Acceleration	Notable Results
MADFormer (Chen et al., 9 Jun 2025)	Image	Spatial blocks	Layerwise (AR→Diff.)	–	Up to 75% FID ↓
TransDiff (Zhen et al., 11 Jun 2025)	Image	Masked latents	Sequential AR + DiT	–	FID=1.42, IS=301.2
LaVieID (Song et al., 11 Aug 2025)	Video	Facial regions, chunks	Global→local→temporal AR	–	SOTA face cons.
ARLON (Li et al., 2024)	Video	VQ-VAE tokens	AR (global) + DiT	–	SOTA long video
ARDiT (Liu et al., 2024)	Audio	Blockwise latent seq.	–	IKL distillation	170 ms audio/step
ACDiT (Hu et al., 2024)	Image/Video	Adjustable block size	Masked blocks	–	FID=2.45, IS=267
Dolfin-AR (Wang et al., 2023)	Layout	Object tokens	AR sequential over objects	–	Alignment↓, FID↓
CausalFusion (Deng et al., 2024)	Multi	Patch/region subsets	Dual axis factorization	–	FID=1.57
DiTAR (Jia et al., 6 Feb 2025)	Audio	Patches	LM (AR) + local DiT	ODE temp. ctrl	SOTA TTS metrics
DiTS (Zhang et al., 6 Feb 2026)	Time Series	Temporal patches	Dual-stream Time/Var-attn	–	SOTA forecasting

These models exemplify the diverse instantiations and domain-specific adaptations of Autoregressive Diffusion Transformers. Their design spaces are unified by the principle of leveraging AR for global/sequential context and diffusion for local iterative refinement, made practical and efficient through advances in attention masking, architectural partitioning, and blockwise computation.

References: