Scaling Diffusion Transformers to 16 Billion Parameters
In the paper "Scaling Diffusion Transformers to 16 Billion Parameters," the authors present DiT-MoE, a sparse version of the Diffusion Transformer model optimized for image generation. Diffusion models have shown superior performance in various generative tasks, yet their computational demands, especially at large scales, pose significant challenges. The authors address this issue through sparse computation via Mixture-of-Experts (MoE), essentially reducing inference costs while maintaining competitive performance relative to dense networks.
Contributions and Methodology
The paper makes several noteworthy contributions:
- MoE for Diffusion Transformers: The primary contribution of this work is the introduction of DiT-MoE, a sparsely-activated diffusion Transformer model specifically designed for image synthesis. DiT-MoE replaces a subset of dense feedforward layers with sparse MoE layers. In this architecture, each image patch token is routed to a subset of experts, implemented as MLP layers.
- Shared Expert Routing and Expert-Level Balance Loss: To optimize the sparse model, two principal strategies are employed: shared expert routing, which captures common knowledge among experts, and an expert-level balance loss, which reduces redundancy by ensuring balanced utilization across experts.
- Expert Specialization Analysis: The authors provide a comprehensive analysis of expert routing, observing that expert selection shows a preference for spatial position and denoising timestep, and becomes more balanced across different MoE layers as depth increases. An interesting observation is that experts tend to specialize more at early timesteps, gradually uniformizing as denoising progresses.
Experimental Results
The authors validate the efficacy of DiT-MoE on class-conditional image generation tasks using the ImageNet dataset at resolutions of 256×256 and 512×512. The models are evaluated primarily using FID-50K scores, and DiT-MoE demonstrates superiority over traditional dense models.
At a large scale, the authors scale DiT-MoE to 16.5 billion parameters—activating only 3.1 billion parameters during inference—achieving a state-of-the-art FID-50K score of 1.80 in the 512×512 resolution setting. This result underscores the effectiveness of the sparse model in improving generative performance without the commensurate computational burden.
Implications and Future Directions
The findings in this paper have both practical and theoretical implications:
- Practical Implications: The implementation of DiT-MoE significantly reduces the computational load during inference while achieving state-of-the-art performance in image generation tasks. This efficiency has the potential to alleviate the high costs and environmental impact associated with training large-scale diffusion models.
- Theoretical Implications: The paper provides insights into the internal routing mechanisms of MoE layers in diffusion transformers, highlighting the temporal and spatial factors influencing expert selection. This understanding can inform future model designs to better leverage expert specialization.
Looking forward, the research opens several avenues. Future work can explore the development of more efficient training protocols to mitigate the observed loss spikes when scaling expert numbers. Additionally, the potential for heterogeneous expert architectures and improved knowledge distillation methods promises further advancements. Overall, this work establishes a foundational approach to scalable, efficient generative modeling in the domain of diffusion transformers, anticipating broader adoption and continued innovation in sparse model scaling techniques.