EC-DIT: Scaling Diffusion Transformers with Adaptive Expert-Choice Routing (2410.02098v5)

Published 2 Oct 2024 in cs.CV and cs.LG

Abstract: Diffusion transformers have been widely adopted for text-to-image synthesis. While scaling these models up to billions of parameters shows promise, the effectiveness of scaling beyond current sizes remains underexplored and challenging. By explicitly exploiting the computational heterogeneity of image generations, we develop a new family of Mixture-of-Experts (MoE) models (EC-DIT) for diffusion transformers with expert-choice routing. EC-DIT learns to adaptively optimize the compute allocated to understand the input texts and generate the respective image patches, enabling heterogeneous computation aligned with varying text-image complexities. This heterogeneity provides an efficient way of scaling EC-DIT up to 97 billion parameters and achieving significant improvements in training convergence, text-to-image alignment, and overall generation quality over dense models and conventional MoE models. Through extensive ablations, we show that EC-DIT demonstrates superior scalability and adaptive compute allocation by recognizing varying textual importance through end-to-end training. Notably, in text-to-image alignment evaluation, our largest models achieve a state-of-the-art GenEval score of 71.68% and still maintain competitive inference speed with intuitive interpretability.

Citations (1)

View on Semantic Scholar

Summary

The paper introduces an adaptive expert-choice routing mechanism that scales diffusion transformers up to 97B parameters while improving inference speed and image quality.
The methodology leverages global context in routing decisions to allocate computation adaptively based on token complexity in text-to-image synthesis.
The model achieves a GenEval score of 71.68%, outperforming previous models and setting a benchmark for efficient large-scale image synthesis.

EC-DiT: Scaling Diffusion Transformers with Adaptive Expert-Choice Routing

The paper presents an innovative approach to scaling diffusion transformers (DiT) in text-to-image synthesis through a novel methodology leveraging adaptive expert-choice routing, termed as EC-DiT. The research addresses critical challenges in scaling diffusion models by exploiting computational heterogeneity, enabling them to operate efficiently at unprecedented scales—up to 97 billion parameters.

Methodology

The paper introduces a sparsely scaled DiT model employing expert-choice routing, a departure from the conventional token-choice routing typically used in Mixture-of-Experts (MoE) configurations. This methodology allows for adaptive computation allocation, distinguishing it from traditional approaches that uniformly distribute computational resources across tokens.

The expert-choice routing leverages global context information from image sequences, selectively activating a subset of experts to process tokens of varying importance and complexity. This is particularly aligned with the characteristics of diffusion models, which handle complete image sequences in one pass rather than token-by-token. The adaptive nature of this routing enables a more efficient scaling, maintaining inference speed while enhancing model capacity comprehensively.

Numerical Results

The paper reports substantial improvements in scalability and performance metrics. EC-DiT exhibits superior loss convergence rates and image quality improvements over dense models and traditional MoE models. In text-to-image alignment evaluation, the model achieves a GenEval score of 71.68%, outperforming existing models such as SD3, and maintaining competitive language understanding without compromising inference speeds—a noteworthy achievement for a model of this scale.

Implications and Future Work

The research demonstrates that diffusion transformers can be scaled effectively by tailoring computation allocation to image complexities, providing a pathway to more efficient large-scale models. Practically, this approach could significantly impact applications requiring detailed and high-resolution image synthesis, such as digital art generation and advanced graphical content creation.

From a theoretical perspective, the introduction of global context into routing decisions presents a new paradigm for designing sparse models. This could inspire further research into optimizing expert allocation based on diverse contextual cues.

Future work may explore integrating additional factors, such as semantic understanding or compositionality, into EC-DiT’s routing decisions, potentially amplifying the model’s capability. Additionally, refining the adaptive routing mechanism to better exploit multimodal information could yield further enhancements in generation quality and efficiency.

Conclusion

This paper presents a substantial advancement in the scaling of diffusion transformers through the introduction of adaptive expert-choice routing. By aligning computational resource allocation with image complexity, EC-DiT sets a new benchmark for large model efficiency and text-to-image synthesis quality, opening new avenues for both application and theoretical exploration in the field of AI and machine learning.

PDF Markdown

Related Papers

Tweets

https://twitter.com/SwayStar123/status/1893922893458805030

https://twitter.com/haotiansun014/status/1844503506918814135