Scaling Diffusion Transformers to 16 Billion Parameters (2407.11633v3)

Published 16 Jul 2024 in cs.CV

Abstract: In this paper, we present DiT-MoE, a sparse version of the diffusion Transformer, that is scalable and competitive with dense networks while exhibiting highly optimized inference. The DiT-MoE includes two simple designs: shared expert routing and expert-level balance loss, thereby capturing common knowledge and reducing redundancy among the different routed experts. When applied to conditional image generation, a deep analysis of experts specialization gains some interesting observations: (i) Expert selection shows preference with spatial position and denoising time step, while insensitive with different class-conditional information; (ii) As the MoE layers go deeper, the selection of experts gradually shifts from specific spacial position to dispersion and balance. (iii) Expert specialization tends to be more concentrated at the early time step and then gradually uniform after half. We attribute it to the diffusion process that first models the low-frequency spatial information and then high-frequency complex information. Based on the above guidance, a series of DiT-MoE experimentally achieves performance on par with dense networks yet requires much less computational load during inference. More encouragingly, we demonstrate the potential of DiT-MoE with synthesized image data, scaling diffusion model at a 16.5B parameter that attains a new SoTA FID-50K score of 1.80 in 512$\times$512 resolution settings. The project page: https://github.com/feizc/DiT-MoE.

PDF HTML Abstract

Scaling Diffusion Transformers to 16 Billion Parameters

In the paper "Scaling Diffusion Transformers to 16 Billion Parameters," the authors present DiT-MoE, a sparse version of the Diffusion Transformer model optimized for image generation. Diffusion models have shown superior performance in various generative tasks, yet their computational demands, especially at large scales, pose significant challenges. The authors address this issue through sparse computation via Mixture-of-Experts (MoE), essentially reducing inference costs while maintaining competitive performance relative to dense networks.

Contributions and Methodology

The paper makes several noteworthy contributions:

MoE for Diffusion Transformers: The primary contribution of this work is the introduction of DiT-MoE, a sparsely-activated diffusion Transformer model specifically designed for image synthesis. DiT-MoE replaces a subset of dense feedforward layers with sparse MoE layers. In this architecture, each image patch token is routed to a subset of experts, implemented as MLP layers.
Shared Expert Routing and Expert-Level Balance Loss: To optimize the sparse model, two principal strategies are employed: shared expert routing, which captures common knowledge among experts, and an expert-level balance loss, which reduces redundancy by ensuring balanced utilization across experts.
Expert Specialization Analysis: The authors provide a comprehensive analysis of expert routing, observing that expert selection shows a preference for spatial position and denoising timestep, and becomes more balanced across different MoE layers as depth increases. An interesting observation is that experts tend to specialize more at early timesteps, gradually uniformizing as denoising progresses.

Experimental Results

The authors validate the efficacy of DiT-MoE on class-conditional image generation tasks using the ImageNet dataset at resolutions of 256×256 and 512×512. The models are evaluated primarily using FID-50K scores, and DiT-MoE demonstrates superiority over traditional dense models.

At a large scale, the authors scale DiT-MoE to 16.5 billion parameters—activating only 3.1 billion parameters during inference—achieving a state-of-the-art FID-50K score of 1.80 in the 512×512 resolution setting. This result underscores the effectiveness of the sparse model in improving generative performance without the commensurate computational burden.

Implications and Future Directions

The findings in this paper have both practical and theoretical implications:

Practical Implications: The implementation of DiT-MoE significantly reduces the computational load during inference while achieving state-of-the-art performance in image generation tasks. This efficiency has the potential to alleviate the high costs and environmental impact associated with training large-scale diffusion models.
Theoretical Implications: The paper provides insights into the internal routing mechanisms of MoE layers in diffusion transformers, highlighting the temporal and spatial factors influencing expert selection. This understanding can inform future model designs to better leverage expert specialization.

Looking forward, the research opens several avenues. Future work can explore the development of more efficient training protocols to mitigate the observed loss spikes when scaling expert numbers. Additionally, the potential for heterogeneous expert architectures and improved knowledge distillation methods promises further advancements. Overall, this work establishes a foundational approach to scalable, efficient generative modeling in the domain of diffusion transformers, anticipating broader adoption and continued innovation in sparse model scaling techniques.