Towards Precise Scaling Laws for Video Diffusion Transformers (2411.17470v2)

Published 25 Nov 2024 in cs.CV, cs.AI, and cs.LG

Abstract: Achieving optimal performance of video diffusion transformers within given data and compute budget is crucial due to their high training costs. This necessitates precisely determining the optimal model size and training hyperparameters before large-scale training. While scaling laws are employed in LLMs to predict performance, their existence and accurate derivation in visual generation models remain underexplored. In this paper, we systematically analyze scaling laws for video diffusion transformers and confirm their presence. Moreover, we discover that, unlike LLMs, video diffusion models are more sensitive to learning rate and batch size, two hyperparameters often not precisely modeled. To address this, we propose a new scaling law that predicts optimal hyperparameters for any model size and compute budget. Under these optimal settings, we achieve comparable performance and reduce inference costs by 40.1% compared to conventional scaling methods, within a compute budget of 1e10 TFlops. Furthermore, we establish a more generalized and precise relationship among validation loss, any model size, and compute budget. This enables performance prediction for non-optimal model sizes, which may also be appealed under practical inference cost constraints, achieving a better trade-off.

Summary

The paper identifies scaling laws for video diffusion transformers that predict optimal performance based on model size and compute budget.
It reveals that video models are highly sensitive to learning rate and batch size, necessitating refined tuning strategies beyond conventional methods.
The proposed scaling law reduces inference costs by 40.1%, offering a practical framework to balance model size, data, and computational constraints.

Towards Precise Scaling Laws for Video Diffusion Transformers

The paper "Towards Precise Scaling Laws for Video Diffusion Transformers" by Yuanyang Yin et al. investigates the scalability of Video Diffusion Transformers (VDTs) in the context of optimal model size and training hyperparameters, focusing on the inherent trade-offs between model size, data size, and compute budget. The significance of this research lies in addressing the high training costs associated with video diffusion models, where determining the optimal configuration before engaging in large-scale training is essential.

Key Contributions

Identification of Scaling Laws: The authors establish the presence of scaling laws for video diffusion transformers, analogous to those identified for LLMs, thus providing a framework for predicting model performance based on size and training resources.
Sensitivities in Hyperparameter Tuning: Unlike LLMs, video diffusion models exhibit heightened sensitivity to learning rate and batch size. This sensitivity is often inadequately modeled by conventional scaling laws, leading to suboptimal performance predictions.
Proposed Scaling Law for Hyperparameters: A novel scaling law is introduced, which anticipates optimal hyperparameters (learning rate and batch size) for any model size and compute budget. This law results in a 40.1% reduction in inference costs compared to traditional methods, under a constrained compute budget of $1 \times 10^{10}$ TFlops.
Generalized Relationship for Validation Loss: The research delineates a precise relationship between validation loss, model size, and compute budget. This relationship facilitates accurate performance predictions even for non-optimal sizes, thereby enhancing the decision-making process when faced with practical inference constraints.

Methodology

The paper employs a systematic approach by applying a traditional scaling law drawn from LLMs to video diffusion transformers and assessing its applicability. Initial results highlighted inaccuracies due to the improper tuning of hyperparameters. Consequently, the authors derived a new scaling law that integrates hyperparameter tuning explicitly as a function of model size and data size. The empirical results validate that by optimizing hyperparameters, the predicted model size under a specified compute budget significantly reduces parameters without compromising performance.

Implications and Future Directions

The findings have both practical and theoretical implications. Practically, the paper provides methodologies for reducing computational costs while maintaining model efficacy, important for deploying generative models in real-world applications. Theoretically, it enriches the understanding of scaling behaviors in video generation models distinct from text and static image domains. Future research could explore:

Extension to Higher Resolutions and Larger Models: Further investigation is warranted to assess scalability in higher resolutions and larger parameter spaces, potentially demanding more nuanced hyperparameter tuning strategies.
Alternative Scheduling for Learning Rates: While this paper focuses on constant learning rates, adopting learning rate decay strategies could offer more efficient training regimes, albeit at increased computational demands.
Impact of Video Specificities: Examining the influence of various video characteristics, such as resolution and frame rate, on the scaling laws would provide deeper insights into the generalized applicability of the proposed scaling laws.

Overall, the paper significantly contributes to the efficient scaling of video diffusion models by refining and expanding the understanding of scaling laws, thus fostering advancements in efficient video generation technologies.