Diffusion Transformers (DiTs): Scalable Image Generation
- Diffusion Transformers (DiTs) are generative models that integrate transformer architectures with latent diffusion to overcome U-Net limitations and enhance image synthesis.
- The architecture remolds image diffusion by transforming images into latent patch tokens via a pre-trained VAE and positional embeddings, enabling dynamic conditioning and efficient computation.
- Empirical results demonstrate that increased compute and finer token granularity lead to state-of-the-art ImageNet benchmarks, proving the scalability of DiTs in high-fidelity generation.
Diffusion Transformers (DiTs) are a class of generative models that combine latent diffusion modeling with transformer architectures, aiming to overcome the design bottlenecks and scalability limitations of conventional convolutional U-Net–based diffusion backbones. By substituting the U-Net with a transformer backbone operating over sequences of latent “patch” tokens, DiTs harness the advantages of transformers—broadly adopted in vision and LLMing—for scalable, high-fidelity image generation. DiTs rigorously demonstrate that model capacity, as measured in compute (Gflops) and token granularity, directly correlates with sample quality as assessed by standard generative metrics.
1. Architectural Innovations and Latent Patch Tokenization
DiTs repurpose the diffusion modeling pipeline by introducing a transformer network in place of the classical U-Net. The design leverages a pre-trained variational autoencoder (VAE) to encode images into a lower-dimensional latent space (e.g., a tensor of shape for a image). This latent is further subdivided into a 1D sequence of patches (“patchify” operation), each of which serves as an input token to a transformer. This patchification, parameterized by patch size , directly controls the number of tokens . Smaller patch sizes yield a larger token count, increasing transformer sequence length and thus compute, given the quadratic scaling of self-attention. Sine-cosine positional embeddings are incorporated to preserve spatial structure. The latent transformer backbone processes these tokens through standard blocks, with support for numerous conditioning strategies:
- In-context and cross-attention for time steps and label conditioning
- Adaptive layer normalization (adaLN) schemes (e.g., adaLN-Zero initialization) enable dynamic adaptation based on conditioning information
The denoising reverse process is trained to predict additive noise in each latent as
with , and the network minimizes the loss
Classifier-free guidance augments conditional sample quality via
with scale and conditioning (e.g., class label).
2. Scalability: Compute, Tokenization, and Performance Trends
A distinguishing empirical finding is that DiTs exhibit robust scalability with respect to both architectural scale (depth, width) and tokenization granularity. Compute is measured in forward Gflops, determined by the number of transformer layers, per-layer width, and the number of input tokens. Experiments reveal a strong negative correlation between Gflops and FID (Fréchet Inception Distance)—larger models with higher compute budgets and more tokens achieve strictly lower FID. This scaling law holds for a range of DiT configurations, from DiT-Small (S) to DiT-Extra Large (XL), as well as patch sizes from down to .
This observation supports a crucial distinction between capacity (parameter count) and effective compute: for DiTs, it is the Gflops—reflecting actual computational cost due to sequence length and depth—that governs generation quality, not just gross parameter scaling. Such trends provide empirical evidence that DiT sample quality is directly improved by raising the available compute budget, regardless of whether additional capacity is allocated to depth, width, or token density.
3. State-of-the-Art Results: ImageNet Benchmarks
DiTs establish a new state-of-the-art for class-conditional generation on the ImageNet and benchmarks. The DiT-XL/2 configuration, operating at about 118.6 Gflops, obtains an FID of $2.27$ on ImageNet —outperforming all prior diffusion models, including leading U-Net–based approaches such as ADM and LDM, at substantially lower computational cost in the latent space. The sample fidelity remains strong even as model and token count are increased, affirming the practical gain obtained from scaling DiT architectures. Evaluation utilizes FID, sFID, Inception Score, and Precision/Recall metrics, consistently demonstrating that compute—rather than U-Net–specific architectural inductive biases—drives sample quality.
4. Technical Formulation and Conditional Diffusion
DiTs implement the denoising score network using transformer layers that operate over latent patch tokens, augmented for conditioning as follows:
- The latent "patchify" procedure transforms the latent image into a token sequence of length .
- Each block in the transformer operates as
with
where and are functions of time step and label embeddings. The adaLN-Zero variant initializes all residual blocks to identity, improving stability and sample quality.
- Conditioning is realized through modulation of layer norm parameters, as well as token-level input augmentation for class or time-step information.
5. Implications: Architectural Unification and Future Directions
This body of evidence directly challenges the assertion that convolutional, spatially localized inductive biases inherent to U-Nets are strictly necessary for high-fidelity image diffusion. The demonstrated efficacy of a wholly transformer-based backbone in the latent domain supports architectural unification across NLP and vision, enabling diffusion model scaling with the same methodologies (deeper, wider, longer transformers, etc.) that have propelled LLMs.
Directions highlighted as promising by these results include:
- Exploring larger, deeper transformers with finer patch sizes to further exploit the scaling trend.
- Integrating DiTs into text-to-image and multimodal frameworks (e.g., as backbones for models like DALL·E 2 or Stable Diffusion) in lieu of U-Nets.
- Investigating novel conditioning mechanisms (e.g., CLIP-guided or modality-bridging embeddings) that may be more naturally realized in a transformer architecture.
- Examining the transferability of pre-trained DiT backbones across domains and resolutions, following the paradigm in vision transformers and LLMs.
In conclusion, DiTs represent a demonstrably scalable and high-performing generative model for image synthesis, powered by the transformer’s capacity to process latent patch tokens, modulate conditional information flexibly, and scale compute efficiently, providing a foundation for continued advancement in generative modeling beyond the limits of convolutional architectures (Peebles et al., 2022).