- The paper introduces an adaptive expert-choice routing mechanism that scales diffusion transformers up to 97B parameters while improving inference speed and image quality.
- The methodology leverages global context in routing decisions to allocate computation adaptively based on token complexity in text-to-image synthesis.
- The model achieves a GenEval score of 71.68%, outperforming previous models and setting a benchmark for efficient large-scale image synthesis.
The paper presents an innovative approach to scaling diffusion transformers (DiT) in text-to-image synthesis through a novel methodology leveraging adaptive expert-choice routing, termed as EC-DiT. The research addresses critical challenges in scaling diffusion models by exploiting computational heterogeneity, enabling them to operate efficiently at unprecedented scales—up to 97 billion parameters.
Methodology
The paper introduces a sparsely scaled DiT model employing expert-choice routing, a departure from the conventional token-choice routing typically used in Mixture-of-Experts (MoE) configurations. This methodology allows for adaptive computation allocation, distinguishing it from traditional approaches that uniformly distribute computational resources across tokens.
The expert-choice routing leverages global context information from image sequences, selectively activating a subset of experts to process tokens of varying importance and complexity. This is particularly aligned with the characteristics of diffusion models, which handle complete image sequences in one pass rather than token-by-token. The adaptive nature of this routing enables a more efficient scaling, maintaining inference speed while enhancing model capacity comprehensively.
Numerical Results
The paper reports substantial improvements in scalability and performance metrics. EC-DiT exhibits superior loss convergence rates and image quality improvements over dense models and traditional MoE models. In text-to-image alignment evaluation, the model achieves a GenEval score of 71.68%, outperforming existing models such as SD3, and maintaining competitive language understanding without compromising inference speeds—a noteworthy achievement for a model of this scale.
Implications and Future Work
The research demonstrates that diffusion transformers can be scaled effectively by tailoring computation allocation to image complexities, providing a pathway to more efficient large-scale models. Practically, this approach could significantly impact applications requiring detailed and high-resolution image synthesis, such as digital art generation and advanced graphical content creation.
From a theoretical perspective, the introduction of global context into routing decisions presents a new paradigm for designing sparse models. This could inspire further research into optimizing expert allocation based on diverse contextual cues.
Future work may explore integrating additional factors, such as semantic understanding or compositionality, into EC-DiT’s routing decisions, potentially amplifying the model’s capability. Additionally, refining the adaptive routing mechanism to better exploit multimodal information could yield further enhancements in generation quality and efficiency.
Conclusion
This paper presents a substantial advancement in the scaling of diffusion transformers through the introduction of adaptive expert-choice routing. By aligning computational resource allocation with image complexity, EC-DiT sets a new benchmark for large model efficiency and text-to-image synthesis quality, opening new avenues for both application and theoretical exploration in the field of AI and machine learning.