Scaling Exponents Across Parameterizations and Optimizers (2407.05872v2)

Published 8 Jul 2024 in cs.LG

Abstract: Robust and effective scaling of models from small to large width typically requires the precise adjustment of many algorithmic and architectural details, such as parameterization and optimizer choices. In this work, we propose a new perspective on parameterization by investigating a key assumption in prior work about the alignment between parameters and data and derive new theoretical results under weaker assumptions and a broader set of optimizers. Our extensive empirical investigation includes tens of thousands of models trained with all combinations of three optimizers, four parameterizations, several alignment assumptions, more than a dozen learning rates, and fourteen model sizes up to 26.8B parameters. We find that the best learning rate scaling prescription would often have been excluded by the assumptions in prior work. Our results show that all parameterizations, not just maximal update parameterization (muP), can achieve hyperparameter transfer; moreover, our novel per-layer learning rate prescription for standard parameterization outperforms muP. Finally, we demonstrate that an overlooked aspect of parameterization, the epsilon parameter in Adam, must be scaled correctly to avoid gradient underflow and propose Adam-atan2, a new numerically stable, scale-invariant version of Adam that eliminates the epsilon hyperparameter entirely.

Citations (11)

View on Semantic Scholar

Summary

The paper establishes that parameter-data alignment is critical for determining stable per-layer learning rates across various network parameterizations.
An extensive empirical study across millions to billions of parameters reveals distinct scaling behaviors and alignment patterns.
Introducing Adam-atan2, the study highlights a scale-invariant optimizer variant that eliminates the need for tuning the epsilon parameter.

Overview of "Scaling Exponents Across Parameterizations and Optimizers"

The paper "Scaling Exponents Across Parameterizations and Optimizers," authored by Katie Everett et al., provides an in-depth exploration into the ramifications of model parameterization and optimizer choices on the scaling behavior of large neural networks. By proposing a novel perspective and conducting an extensive empirical investigation, this paper explores the intricacies of how learning rate prescriptions impact training dynamics across varying scales.

Summary

The paper aims to unify the understanding of parameterization and optimizer scaling by investigating the impact of parameter-data alignment on learning rates. The paper posits that alignment plays a critical role in determining stable and optimal learning rates, which affects the model's ability to scale effectively across different widths.

Theoretical Contributions

The paper introduces a general space of parameterizations that quantifies alignment contributions using three variables: $\alpha_l$ , $\omega_l$ , and $u_l$ . These alignment variables represent the different ways in which parameters and data can become correlated during training. By relaxing assumptions from previous works, the authors demonstrate that parameterization choices interact more complexly with alignment than previously assumed.

Specifically, the authors show that the stability of neural network training can be maintained under broader conditions than previously thought, assuming these alignment variables are properly accounted for. They extend this analysis to a broader family of adaptive optimizers such as Adafactor.

Empirical Exploration

The empirical paper encompasses tens of thousands of experiments. The authors trained models varying in size from millions to billions of parameters across different parameterizations, including standard parameterization, Neural Tangent Kernel (NTK), Maximal Update Parameterization (muP), and Mean-Field Parameterization (MFP). The analysis is conducted using three optimizers: SGD, Adam, and Adafactor.

The key results include:

Alignment Measurement: The alignment between parameters and data is dynamic and varies significantly throughout training. Intermediate alignment values and variability across layers were observed, with specific patterns emerging for different parameterizations.
Per-Layer Learning Rates: The paper explores learning rate prescriptions tailored to different alignments, showing that all parameterizations benefit from theoretically motivated, per-layer learning rate exponents. Optimal performance is often achieved by tuning these learning rates at small scales and transferring them to larger models.
Epsilon Parameter in Adam: The paper identifies the epsilon hyperparameter in Adam as a crucial factor to manage, especially for avoiding gradient underflow issues in large models. It proposes "Adam-atan2," a scale-invariant variant of Adam that eliminates the need for this hyperparameter altogether.

Implications and Future Directions

The findings significantly impact both theoretical and practical aspects of training large neural networks.

Theoretical Implications: By relaxing prior assumptions about alignment, the paper broadens the understanding of stable training regimes. This can lead to more flexible and accurate theoretical models that are better aligned with empirical observations.
Practical Implications: The empirical results suggest that parameterization choices and learning rate schedules can be fine-tuned to achieve better performance at scale. The suggestion to use per-layer learning rates, especially with careful tuning of constant multipliers, is particularly impactful for practitioners aiming to scale models efficiently.
Optimizers: The introduction of Adam-atan2 presents a promising direction for scalable, stable training in adaptive optimizers, potentially reducing the time and resources spent in hyperparameter tuning for very large models.

Conclusion

"Scaling Exponents Across Parameterizations and Optimizers" provides a nuanced and comprehensive exploration of how parameterization and optimization choices interplay with alignment effects in neural network training. By expanding the theoretical understanding and providing practical guidelines, the paper sets the stage for more robust and efficient training of large-scale models. Future research could further explore alignment-aware learning rate schedules and delve into co-scaling multiple dimensions beyond width to optimize training in diverse neural network architectures.