Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
117 tokens/sec
GPT-4o
8 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Scaling Diffusion Transformers Efficiently via $μ$P (2505.15270v1)

Published 21 May 2025 in cs.LG, cs.AI, and cs.CV

Abstract: Diffusion Transformers have emerged as the foundation for vision generative models, but their scalability is limited by the high cost of hyperparameter (HP) tuning at large scales. Recently, Maximal Update Parametrization ($\mu$P) was proposed for vanilla Transformers, which enables stable HP transfer from small to LLMs, and dramatically reduces tuning costs. However, it remains unclear whether $\mu$P of vanilla Transformers extends to diffusion Transformers, which differ architecturally and objectively. In this work, we generalize standard $\mu$P to diffusion Transformers and validate its effectiveness through large-scale experiments. First, we rigorously prove that $\mu$P of mainstream diffusion Transformers, including DiT, U-ViT, PixArt-$\alpha$, and MMDiT, aligns with that of the vanilla Transformer, enabling the direct application of existing $\mu$P methodologies. Leveraging this result, we systematically demonstrate that DiT-$\mu$P enjoys robust HP transferability. Notably, DiT-XL-2-$\mu$P with transferred learning rate achieves 2.9 times faster convergence than the original DiT-XL-2. Finally, we validate the effectiveness of $\mu$P on text-to-image generation by scaling PixArt-$\alpha$ from 0.04B to 0.61B and MMDiT from 0.18B to 18B. In both cases, models under $\mu$P outperform their respective baselines while requiring small tuning cost, only 5.5% of one training run for PixArt-$\alpha$ and 3% of consumption by human experts for MMDiT-18B. These results establish $\mu$P as a principled and efficient framework for scaling diffusion Transformers.

Summary

Scaling Diffusion Transformers Efficiently via μ\muP

In the paper Scaling Diffusion Transformers Efficiently via μ\muP, the authors investigate the application of Maximal Update Parametrization (μ\muP) to diffusion transformers, aiming to efficiently scale these models. Diffusion transformers have become integral to generative modeling in the vision domain, providing scalable architectures for tasks like image and video generation. However, their performance at scale is hindered by the high cost associated with hyperparameter tuning. The paper proposes a solution by adapting μ\muP, originally developed for vanilla transformers, to diffusion transformers.

Core Contributions and Results

  1. Generalizing μ\muP to Diffusion Transformers: The paper rigorously proves that the μ\muP formulation for mainstream diffusion transformers, such as DiT, U-ViT, PixArt-α\alpha, and MMDiT, aligns with that of vanilla transformers. This compatibility enables the direct application of existing μ\muP methodologies, facilitating robust hyperparameter transferability across model scales.
  2. Performance Improvements: The authors demonstrate that implementing μ\muP in diffusion transformers offers substantial improvements. For instance, the DiT-XL-2-μ\muP model exhibited a convergence rate 2.9 times faster compared to the traditional DiT-XL-2 model, highlighting the potential for enhanced training efficiency.
  3. Scaling Text-to-Image Applications: The paper validates the effectiveness of μ\muP in practical applications by scaling PixArt-α\alpha from 0.04B to 0.61B parameters, and MMDiT from 0.18B to 18B parameters. In both scenarios, μ\muP models outperformed their baselines with significantly lower tuning costs—just 5.5% and 3% of the corresponding full training costs, respectively.

Implications and Future Directions

The findings of this paper have significant implications for the field of generative modeling, especially as models continue to grow in size. Reducing the computational and time costs associated with hyperparameter tuning opens up pathways for more extensive and cost-efficient deployment of diffusion models, potentially revolutionizing applications in various domains such as art creation, virtual reality, and even synthetic data generation.

Theoretically, the paper lays the groundwork for future research into the scaling behaviors of diffusion transformers and the development of more sophisticated scaling laws that could further optimize large model training. Additionally, the robust results from applying μ\muP suggest that similar approaches could be explored for other types of models and tasks beyond vision, such as audio or multimodal synthesis.

Conclusion

Scaling Diffusion Transformers Efficiently via μ\muP introduces a practical and theoretically sound method for scaling diffusion transformers by leveraging the μ\muP framework. The research provides a pathway for efficient hyperparameter tuning across different model scales, enhancing training speed and reducing computational costs. This work will likely catalyze further innovation in efficient model scaling, with broader implications for the future of artificial intelligence and generative models.

Youtube Logo Streamline Icon: https://streamlinehq.com