MADFormer: Hybrid AR-Diffusion Transformer

Updated 10 June 2026

MADFormer is a hybrid generative Transformer that integrates both autoregressive and diffusion objectives for continuous image generation.
It leverages spatial block partitioning and vertical mixing, balancing global structure capture with local high-fidelity refinement.
Empirical studies on FFHQ and ImageNet show that AR-heavy settings perform well under low compute, while increased diffusion depth further reduces FID scores.

MADFormer is a hybrid generative Transformer model that systematically integrates autoregressive (AR) and diffusion-based generation mechanisms within a unified architecture for continuous image generation. It addresses the allocation of model capacity between AR and diffusion modules, using a spatial block partitioning of image representations and mixing AR and diffusion objectives across the depth of the Transformer network. Empirical studies on FFHQ-1024 and ImageNet datasets reveal that careful spatial partitioning and vertical (layer-wise) mixing deliver substantial improvements in sample quality under corresponding computational budgets (Chen et al., 9 Jun 2025).

1. Model Architecture and Data Representation

MADFormer encodes multimodal inputs via discrete and continuous channels. Input text is tokenized using Llama 3, while images are encoded by a frozen VAE (e.g., Stable Diffusion VAE) into continuous latent representations $z_{\rm image} \in \mathbb{R}^d$ , which are subsequently linearized in raster order. These latents are partitioned into $L$ contiguous spatial blocks, e.g., $L=16$ blocks of $256 \times 256$ for FFHQ-1024.

All modalities propagate through a single stack of $N$ Transformer layers, but utilize separate FFN and QKV parameters by modality. The early layers $1 \dots (N\!-\!D)$ implement an AR, causally-masked modeling objective, ingesting tokenized text plus previously generated image blocks to produce per-block AR conditioning vectors $c^{(i)}$ . The remaining layers $(N\!-\!D+1)\dots N$ operate under a diffusion denoising objective on noisy target block $z^{(i)}$ at each diffusion step $t$ , conditioning on the AR-computed $L$ 0.

Blockwise, the generative process comprises:

AR pass: Compute $L$ 1 via causal attention.
Diffusion pass: Iteratively denoise $L$ 2 to $L$ 3 conditioned on $L$ 4.

The architecture offers a mid-level decomposition of global structure (via AR) and local refinement (via diffusion).

2. Mathematical Formulation

2.1 Autoregressive Prior

Given latent blocks $L$ 5, the AR prior factorizes as: $L$ 6 Hidden states evolve by

$L$ 7

2.2 Diffusion-Based Denoising

The forward (noising) process for block $L$ 8: $L$ 9 where $L=16$ 0 and $L=16$ 1.

Reverse modeling: $L=16$ 2 with diffusion loss: $L=16$ 3

2.3 Training Objective

The total loss is a weighted sum: $L=16$ 4 with typical weights $L=16$ 5, $L=16$ 6, $L=16$ 7, $L=16$ 8.

3. Inference Procedure

Image generation proceeds blockwise, combining AR conditioning with iterative diffusion. The organism of the process is:

$(N\!-\!D+1)\dots N$ 1

This staged approach divides the generative burden: AR produces core spatial layouts, while diffusion executes high-fidelity refinement.

4. Computational Complexity and Quality-Efficiency Trade-offs

Let $L=16$ 9 be the total Transformer layers ( $256 \times 256$ 0 AR, $256 \times 256$ 1 diffusion), $256 \times 256$ 2 the number of diffusion steps, $256 \times 256$ 3 the number of blocks, and $256 \times 256$ 4 the per-layer evaluation cost. The regimes compare as follows:

Model Type	Function Evals (NFE)	Computational Cost (per image)
Pure AR	$256 \times 256$ 5	$256 \times 256$ 6
Pure Diffusion	$256 \times 256$ 7	$256 \times 256$ 8
Mixed MADFormer	$256 \times 256$ 9	$N$ 0

Empirical analysis demonstrates:

For small NFE, AR-heavy splits (large $N$ 1, small $N$ 2) improve FID by 60–75% over pure diffusion.
As NFE increases, diffusion-heavy splits (larger $N$ 3) surpass in final image fidelity.

This relationship can be fitted by: $N$ 4 with $N$ 5 parameters estimated from observed curves.

5. Experimental Results

Experiments on FFHQ-1024 and ImageNet 256 clarify the modeling regimes:

FFHQ-1024 ( $N$ 6 resolution, $N$ 7)

Diffusion-depth ablation: $N$ 8 FID=20.2; $N$ 9 17.8; $1 \dots (N\!-\!D)$ 0 16.6; $1 \dots (N\!-\!D)$ 1 15.9.
Optimal block partition: $1 \dots (N\!-\!D)$ 2 yields FID=17.8 versus 18.9 ( $1 \dots (N\!-\!D)$ 3) or 21.9 ( $1 \dots (N\!-\!D)$ 4).
Under only 9 diffusion steps (NFE $1 \dots (N\!-\!D)$ 5280): AR:Diff $1 \dots (N\!-\!D)$ 6:7, FID $1 \dots (N\!-\!D)$ 720 (global facial coherence with low step count).

ImageNet 256

Diffusion-depth ablation: $1 \dots (N\!-\!D)$ 8 FID=34.0; $1 \dots (N\!-\!D)$ 9 30.0; $c^{(i)}$ 0 28.1; $c^{(i)}$ 1 27.4.
Best AR-length: $c^{(i)}$ 2 (single block), FID=28.4. Larger $c^{(i)}$ 3 fragments context, degrading quality.

In all tested settings, low NFE favors AR-heavy mixing, while increased compute allows diffusion to ultimately minimize FID further.

6. Design Principles and Best Practices

Several practical design principles are established:

Blockwise Partitioning: For high-res images, increase AR blocks ( $c^{(i)}$ 4 at $c^{(i)}$ 5). For mid-res images ( $c^{(i)}$ 6), a single block suffices.
Vertical Mixing: Early AR layers (ratio $c^{(i)}$ 73:1 for AR:Diff) are critical for quality at low compute; later diffusion layers target local refinement. Adjust the AR/diffusion layer count to fit available NFE.
Auxiliary Losses: Hidden loss on AR conditioning ( $c^{(i)}$ 8) yields $c^{(i)}$ 91.6 FID gain; clean tower provides $(N\!-\!D+1)\dots N$ 02 FID gain; causal attention is necessary (removal increases FFHQ FID from 17.8 to 21.2).
Parameter Sharing: Distinct FFN/QKV sets for text, clean, and noisy blocks show negligible benefit; parameter sharing suffices.

The systematic fusion of AR and diffusion in MADFormer, via spatial partitioning and vertical task mixing, enables principled speed–quality tradeoffs. For constrained inference, AR-heavy hybrids dominate, while aggressive diffusion allocation produces the lowest FID with sufficient compute. These principles inform the architecture of future hybrid generative-vision models (Chen et al., 9 Jun 2025).

Markdown Report Issue Upgrade to Chat

References (1)

MADFormer: Mixed Autoregressive and Diffusion Transformers for Continuous Image Generation (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MADFormer.

MADFormer: Hybrid AR-Diffusion Transformer

1. Model Architecture and Data Representation

2. Mathematical Formulation

2.1 Autoregressive Prior

2.2 Diffusion-Based Denoising

2.3 Training Objective

3. Inference Procedure

4. Computational Complexity and Quality-Efficiency Trade-offs

5. Experimental Results

FFHQ-1024 ( $N$ 6 resolution, $N$ 7)

ImageNet 256

6. Design Principles and Best Practices

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

MADFormer: Hybrid AR-Diffusion Transformer

1. Model Architecture and Data Representation

2. Mathematical Formulation

2.1 Autoregressive Prior

2.2 Diffusion-Based Denoising

2.3 Training Objective

3. Inference Procedure

4. Computational Complexity and Quality-Efficiency Trade-offs

5. Experimental Results

FFHQ-1024 (NNN6 resolution, NNN7)

ImageNet 256

6. Design Principles and Best Practices

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

FFHQ-1024 ( $N$ 6 resolution, $N$ 7)