Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

D-AR: Diffusion via Autoregressive Models (2505.23660v1)

Published 29 May 2025 in cs.CV

Abstract: This paper presents Diffusion via Autoregressive models (D-AR), a new paradigm recasting the image diffusion process as a vanilla autoregressive procedure in the standard next-token-prediction fashion. We start by designing the tokenizer that converts images into sequences of discrete tokens, where tokens in different positions can be decoded into different diffusion denoising steps in the pixel space. Thanks to the diffusion properties, these tokens naturally follow a coarse-to-fine order, which directly lends itself to autoregressive modeling. Therefore, we apply standard next-token prediction on these tokens, without modifying any underlying designs (either causal masks or training/inference strategies), and such sequential autoregressive token generation directly mirrors the diffusion procedure in image space. That is, once the autoregressive model generates an increment of tokens, we can directly decode these tokens into the corresponding diffusion denoising step in the streaming manner. Our pipeline naturally reveals several intriguing properties, for example, it supports consistent previews when generating only a subset of tokens and enables zero-shot layout-controlled synthesis. On the standard ImageNet benchmark, our method achieves 2.09 FID using a 775M Llama backbone with 256 discrete tokens. We hope our work can inspire future research on unified autoregressive architectures of visual synthesis, especially with LLMs. Code and models will be available at https://github.com/showlab/D-AR

Summary

  • The paper introduces a novel framework that recasts image diffusion as a standard autoregressive next-token prediction task.
  • It combines a Sequential Diffusion Tokenizer with a vanilla autoregressive model to efficiently generate high-fidelity images using 256 tokens and 8 diffusion steps.
  • Results on ImageNet demonstrate competitive FID scores and enable zero-shot controlled synthesis with streaming, coarse-to-fine generation previews.

This paper introduces Diffusion via Autoregressive models (D-AR), a novel framework that recasts the image diffusion process as a standard autoregressive next-token-prediction task. The core idea is to leverage the strengths of both diffusion models (high-quality image synthesis) and autoregressive models (efficient, scalable sequence modeling) without modifying the underlying autoregressive architecture.

The D-AR framework consists of two main components: a Sequential Diffusion Tokenizer and a standard Autoregressive Model.

Sequential Diffusion Tokenizer

The tokenizer's role is to convert images into a 1D sequence of discrete tokens, where these tokens correspond to different stages of a diffusion denoising process. This tokenizer operates directly on raw pixels and does not require an additional VAE.

  1. 1D Encoding:
    • An input image I\mathbf{I} is first patchified.
    • A transformer encoder E\mathcal{E} processes these image patches along with a set of NN learnable query tokens [q1,,qN][\mathbf{q}_1, \ldots, \mathbf{q}_N].
    • The output is then quantized using a vector quantizer (VQ) to produce a sequence of discrete tokens z=[z1,,zN]\mathbf{z} = [\mathbf{z}_1, \ldots, \mathbf{z}_N].
    • The paper uses N=256N=256 queries, patch size p=16p=16, transformer dimension d=768d=768, and L=8L=8 layers for the encoder. The VQ uses a codebook of size ne=16384n_e=16384 with 2\ell_2-normalized entries of dimension de=8d_e=8.
  2. Sequential Diffusion Decoding:
    • The discrete tokens z\mathbf{z} are decoded back into an image using a diffusion model, specifically a diffusion transformer (similar to DiT) operating on pixel patches. This decoder is trained using a flow matching loss fm\ell_{\text{fm}} with velocity prediction.

      fm=Et,x0,x1[vtDFM(xt,t,c(t))22]\ell_{\text{fm}} = \mathbb{E}_{t, \mathbf{x}_0, \mathbf{x}_1} \left[ \left\| \mathbf{v}_t - \mathcal{D}_{\text{FM}}(\mathbf{x}_t, t, \mathbf{c}(t)) \right\|_2^2 \right]

      where xt=tx1+(1t)x0\mathbf{x}_t = t\mathbf{x}_1 + (1-t)\mathbf{x}_0, vt=x1x0\mathbf{v}_t = \mathbf{x}_1 - \mathbf{x}_0, x0N(0,1)\mathbf{x}_0 \sim \mathcal{N}(0,1) is noise, and x1=I\mathbf{x}_1 = \mathbf{I} is the target image.

    • Crucially, the conditioning tokens c(t)\mathbf{c}(t) for the diffusion decoder are selected from the sequence z\mathbf{z} based on the diffusion timestep tt. The NN tokens are divided into KK groups {g1,,gK}\{\mathbf{g}_1, \ldots, \mathbf{g}_K\}. The condition schedule is defined as:

      c(t)=gtK,where t=tt+(1/β)(1t)\mathbf{c}(t) = \mathbf{g}_{\lceil t' \cdot K \rceil}, \quad \text{where } t' = \frac{t}{t + (1/\beta)(1-t)}

      A higher β\beta (default β=2\beta=2) means earlier diffusion steps are conditioned on denser tokens. This design ensures that tokens zi\mathbf{z}_i generated earlier by the autoregressive model correspond to earlier diffusion steps (coarse, global features), while later tokens correspond to later diffusion steps (finer details).

    • The diffusion decoder has Ld=12L_d=12 layers, hidden dimension dd=768d_d=768, and patch size pd=8p_d=8, totaling 185M parameters. The full tokenizer has 300M parameters.

    • Additional causal transformer decoder layers are added after VQ and before diffusion decoding to introduce more non-linearity.

Tokenizer Training:

The tokenizer is trained end-to-end to reconstruct the input image. The loss function combines flow matching, VQ loss, LPIPS perceptual loss, and REPA representation alignment loss:

tokenizer=fm+VQ+0.5LPIPS+0.5repa\ell_{\text{tokenizer}} = \ell_{\text{fm}} + \ell_{\text{VQ}} + 0.5 \ell_{\text{LPIPS}} + 0.5 \ell_{\text{repa}}

Tokenizer Sampling (Decoding Tokens to Image):

Given a sequence of tokens (from the encoder or AR model), the image is reconstructed by running the diffusion ODE solver. The default uses K=8K=8 sampling steps (each step corresponding to one group of N/K=32N/K = 32 tokens). The sampling timesteps tit_i are scheduled to be denser at earlier stages:

ti=i/K(i/K)+β(1i/K),i=0,,K1t_i = \frac{i/K}{(i/K)+\beta * (1-i/K)}, \quad i=0, \ldots, K-1

The paper uses an Adams-Bashforth 2nd order solver for 8 steps, achieving an rFID of 1.52 (Table~\ref{tab:tokenizer_sampling}).

Autoregressive Modeling

Once images are tokenized into these diffusion-ordered sequences, a standard decoder-only transformer (Llama architecture) is trained for next-token prediction:

pθ(z)=i=1Npθ(ziz1,,zi1)p_\theta(\mathbf{z}) = \prod_{i=1}^N p_\theta(\mathbf{z}_i | \mathbf{z}_1, \ldots, \mathbf{z}_{i-1})

This uses a simple cross-entropy loss. Notably, no modifications to the AR model's architecture (e.g., attention masks, training objectives) are needed. It uses 1D RoPE for positional embeddings. Class conditions are injected as a prefix token. Classifier-Free Guidance (CFG) is applied on logits during AR inference. Two model sizes are explored: D-AR-L (343M params) and D-AR-XL (775M params).

D-AR Framework: Properties and Advantages

The combination of the sequential diffusion tokenizer and a vanilla AR model yields several benefits:

  1. KV Cache-Friendly Inference: Standard AR inference with KV caching can be used, making generation efficient.
  2. Streaming Pixel Decoding and Consistent Previews: As the AR model generates tokens, corresponding diffusion steps can be performed in a streaming fashion. Furthermore, one can jump-estimate the final image x^1=(1t)vt+xt\hat{\mathbf{x}}_1 = (1-t)\mathbf{v}_t + \mathbf{x}_t at any intermediate diffusion step tt to get consistent previews at almost no extra cost (Figure~\ref{fig:preview}). This provides a coarse-to-fine generation trajectory.
  3. Zero-Shot Controlled Synthesis: The diffusion-induced token ordering (early tokens = global structure) allows for zero-shot layout control by conditioning the AR model on a prefix of tokens from a reference image (Figure~\ref{fig:layout}).

Implementation and Experimental Results

  • Dataset: ImageNet 256×256256 \times 256 class-conditional generation.
  • Tokenizer Training: Trained for 210K iterations (5 days on 16 A100s).
  • AR Model Training: D-AR-L (343M) and D-AR-XL (775M) trained for 300 epochs (2-3 days on 16 A100s).
  • Tokenizer Performance: The proposed tokenizer (300M params) achieves an rFID of 1.58 with a 16384 codebook size and 1.84 with a 4096 codebook size, outperforming the LlamaGen tokenizer (Table~\ref{tab:tokenizer_comparison}).
  • System-Level Performance:
    • D-AR-XL (775M params) achieves an FID of 2.09 on ImageNet 256×256256 \times 256. This is competitive with state-of-the-art methods and significantly outperforms other vanilla AR models like LlamaGen-XXL (1.4B params, 2.34 FID), and is comparable to IBQ-XXL (2.1B params, 2.05 FID) (Table~\ref{tab:system_comparison}).
    • D-AR-L (343M params) achieves an FID of 2.44.
    • These results are achieved using 256 tokens and 8 diffusion sampling steps.

Practical Implications and Applications

  • Unified Vision-LLMs: The D-AR framework's adherence to standard AR practices makes it a promising candidate for building unified multimodal systems where LLMs can natively generate images without specialized components.
  • Efficient High-Quality Image Generation: By combining the generative power of diffusion with the efficiency of AR models (especially with KV caching), D-AR offers a path to faster high-fidelity image synthesis.
  • Controllable Generation: The inherent coarse-to-fine token ordering enables intuitive control over the generation process (e.g., layout) without requiring model fine-tuning. This can be valuable for interactive image editing or synthesis applications.
  • Progressive Generation: The ability to generate consistent previews allows users to see the image take shape progressively, potentially enabling early stopping or adjustments if the generation is not proceeding as desired.

Limitations

  • The models were tested up to 775M parameters. Scaling to larger AR models (common in LLMs) remains future work.
  • The paper focuses on class-conditional image generation on ImageNet. Native text-to-image generation using this framework was not explored but is highlighted as a potential future direction.

In conclusion, D-AR presents a practical and effective method to perform image generation by mapping the diffusion process to a sequence of discrete tokens that can be modeled by a standard autoregressive transformer. This approach successfully leverages the strengths of both paradigms, achieving competitive generation quality while maintaining the simplicity and efficiency of vanilla AR models. The code is planned to be released at \href{https://github.com/showlab/D-AR}{https://github.com/showlab/D-AR}.

Github Logo Streamline Icon: https://streamlinehq.com