Papers
Topics
Authors
Recent
Search
2000 character limit reached

MADFormer: Hybrid AR-Diffusion Transformer

Updated 10 June 2026
  • MADFormer is a hybrid generative Transformer that integrates both autoregressive and diffusion objectives for continuous image generation.
  • It leverages spatial block partitioning and vertical mixing, balancing global structure capture with local high-fidelity refinement.
  • Empirical studies on FFHQ and ImageNet show that AR-heavy settings perform well under low compute, while increased diffusion depth further reduces FID scores.

MADFormer is a hybrid generative Transformer model that systematically integrates autoregressive (AR) and diffusion-based generation mechanisms within a unified architecture for continuous image generation. It addresses the allocation of model capacity between AR and diffusion modules, using a spatial block partitioning of image representations and mixing AR and diffusion objectives across the depth of the Transformer network. Empirical studies on FFHQ-1024 and ImageNet datasets reveal that careful spatial partitioning and vertical (layer-wise) mixing deliver substantial improvements in sample quality under corresponding computational budgets (Chen et al., 9 Jun 2025).

1. Model Architecture and Data Representation

MADFormer encodes multimodal inputs via discrete and continuous channels. Input text is tokenized using Llama 3, while images are encoded by a frozen VAE (e.g., Stable Diffusion VAE) into continuous latent representations zimageRdz_{\rm image} \in \mathbb{R}^d, which are subsequently linearized in raster order. These latents are partitioned into LL contiguous spatial blocks, e.g., L=16L=16 blocks of 256×256256 \times 256 for FFHQ-1024.

All modalities propagate through a single stack of NN Transformer layers, but utilize separate FFN and QKV parameters by modality. The early layers 1(N ⁣ ⁣D)1 \dots (N\!-\!D) implement an AR, causally-masked modeling objective, ingesting tokenized text plus previously generated image blocks to produce per-block AR conditioning vectors c(i)c^{(i)}. The remaining layers (N ⁣ ⁣D+1)N(N\!-\!D+1)\dots N operate under a diffusion denoising objective on noisy target block z(i)z^{(i)} at each diffusion step tt, conditioning on the AR-computed LL0.

Blockwise, the generative process comprises:

  1. AR pass: Compute LL1 via causal attention.
  2. Diffusion pass: Iteratively denoise LL2 to LL3 conditioned on LL4.

The architecture offers a mid-level decomposition of global structure (via AR) and local refinement (via diffusion).

2. Mathematical Formulation

2.1 Autoregressive Prior

Given latent blocks LL5, the AR prior factorizes as: LL6 Hidden states evolve by

LL7

2.2 Diffusion-Based Denoising

The forward (noising) process for block LL8: LL9 where L=16L=160 and L=16L=161.

Reverse modeling: L=16L=162 with diffusion loss: L=16L=163

2.3 Training Objective

The total loss is a weighted sum: L=16L=164 with typical weights L=16L=165, L=16L=166, L=16L=167, L=16L=168.

3. Inference Procedure

Image generation proceeds blockwise, combining AR conditioning with iterative diffusion. The organism of the process is:

(N ⁣ ⁣D+1)N(N\!-\!D+1)\dots N1

This staged approach divides the generative burden: AR produces core spatial layouts, while diffusion executes high-fidelity refinement.

4. Computational Complexity and Quality-Efficiency Trade-offs

Let L=16L=169 be the total Transformer layers (256×256256 \times 2560 AR, 256×256256 \times 2561 diffusion), 256×256256 \times 2562 the number of diffusion steps, 256×256256 \times 2563 the number of blocks, and 256×256256 \times 2564 the per-layer evaluation cost. The regimes compare as follows:

Model Type Function Evals (NFE) Computational Cost (per image)
Pure AR 256×256256 \times 2565 256×256256 \times 2566
Pure Diffusion 256×256256 \times 2567 256×256256 \times 2568
Mixed MADFormer 256×256256 \times 2569 NN0

Empirical analysis demonstrates:

  • For small NFE, AR-heavy splits (large NN1, small NN2) improve FID by 60–75% over pure diffusion.
  • As NFE increases, diffusion-heavy splits (larger NN3) surpass in final image fidelity.

This relationship can be fitted by: NN4 with NN5 parameters estimated from observed curves.

5. Experimental Results

Experiments on FFHQ-1024 and ImageNet 256 clarify the modeling regimes:

FFHQ-1024 (NN6 resolution, NN7)

  • Diffusion-depth ablation: NN8 FID=20.2; NN9 17.8; 1(N ⁣ ⁣D)1 \dots (N\!-\!D)0 16.6; 1(N ⁣ ⁣D)1 \dots (N\!-\!D)1 15.9.
  • Optimal block partition: 1(N ⁣ ⁣D)1 \dots (N\!-\!D)2 yields FID=17.8 versus 18.9 (1(N ⁣ ⁣D)1 \dots (N\!-\!D)3) or 21.9 (1(N ⁣ ⁣D)1 \dots (N\!-\!D)4).
  • Under only 9 diffusion steps (NFE1(N ⁣ ⁣D)1 \dots (N\!-\!D)5280): AR:Diff1(N ⁣ ⁣D)1 \dots (N\!-\!D)6:7, FID1(N ⁣ ⁣D)1 \dots (N\!-\!D)720 (global facial coherence with low step count).

ImageNet 256

  • Diffusion-depth ablation: 1(N ⁣ ⁣D)1 \dots (N\!-\!D)8 FID=34.0; 1(N ⁣ ⁣D)1 \dots (N\!-\!D)9 30.0; c(i)c^{(i)}0 28.1; c(i)c^{(i)}1 27.4.
  • Best AR-length: c(i)c^{(i)}2 (single block), FID=28.4. Larger c(i)c^{(i)}3 fragments context, degrading quality.

In all tested settings, low NFE favors AR-heavy mixing, while increased compute allows diffusion to ultimately minimize FID further.

6. Design Principles and Best Practices

Several practical design principles are established:

  • Blockwise Partitioning: For high-res images, increase AR blocks (c(i)c^{(i)}4 at c(i)c^{(i)}5). For mid-res images (c(i)c^{(i)}6), a single block suffices.
  • Vertical Mixing: Early AR layers (ratio c(i)c^{(i)}73:1 for AR:Diff) are critical for quality at low compute; later diffusion layers target local refinement. Adjust the AR/diffusion layer count to fit available NFE.
  • Auxiliary Losses: Hidden loss on AR conditioning (c(i)c^{(i)}8) yields c(i)c^{(i)}91.6 FID gain; clean tower provides (N ⁣ ⁣D+1)N(N\!-\!D+1)\dots N02 FID gain; causal attention is necessary (removal increases FFHQ FID from 17.8 to 21.2).
  • Parameter Sharing: Distinct FFN/QKV sets for text, clean, and noisy blocks show negligible benefit; parameter sharing suffices.

The systematic fusion of AR and diffusion in MADFormer, via spatial partitioning and vertical task mixing, enables principled speed–quality tradeoffs. For constrained inference, AR-heavy hybrids dominate, while aggressive diffusion allocation produces the lowest FID with sufficient compute. These principles inform the architecture of future hybrid generative-vision models (Chen et al., 9 Jun 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MADFormer.