Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

Gemini 2.5 Flash 100 tok/s

Gemini 2.5 Pro 58 tok/s Pro

GPT-5 Medium 29 tok/s

GPT-5 High 29 tok/s Pro

GPT-4o 103 tok/s

GPT OSS 120B 480 tok/s Pro

Kimi K2 215 tok/s Pro

2000 character limit reached

Rethinking Conventional Wisdom in Machine Learning: From Generalization to Scaling (2409.15156v2)

Published 23 Sep 2024 in cs.LG and stat.ML

Abstract: The remarkable success of large language pretraining and the discovery of scaling laws signify a paradigm shift in machine learning. Notably, the primary objective has evolved from minimizing generalization error to reducing approximation error, and the most effective strategy has transitioned from regularization (in a broad sense) to scaling up models. This raises a critical question: Do the established principles that proved successful in the generalization-centric era remain valid in this new era of scaling? This paper examines several influential regularization-based principles that may no longer hold true in the scaling-centric, LLM era. These principles include explicit L2 regularization and implicit regularization through small batch sizes and large learning rates. Additionally, we identify a new phenomenon termed ``scaling law crossover,'' where two scaling curves intersect at a certain scale, implying that methods effective at smaller scales may not generalize to larger ones. Together, these observations highlight two fundamental questions within this new paradigm: $\bullet$ Guiding Principles for Scaling: If regularization is no longer the primary guiding principle for model design, what new principles are emerging to guide scaling? $\bullet$ Model Comparison at Scale: How to reliably and effectively compare models at the scale where only a single experiment is feasible?

Citations (1)

View on Semantic Scholar

Collections

Summary

The paper challenges conventional ML regularization by demonstrating its limitations in LLM pretraining.
It reveals a scaling law crossover where strategies effective at small scales may not work for large-scale models.
The findings underscore the need for new scaling methodologies and automated hyperparameter tuning to boost performance at scale.

Rethinking Conventional Wisdom in Machine Learning: From Generalization to Scaling

The paper by Lechao Xiao from Google DeepMind critically examines the paradigm shift in machine learning objectives and strategies brought about by LLM pretraining and the discovery of scaling laws. The shift moves the focus from minimizing generalization error on small datasets to reducing approximation error on extensive text corpora, invoking a transition from regularization-based strategies to scaling-centric methodologies. This transformation invites a reevaluation of established machine learning principles, particularly in the context of LLMs.

Key Content and Observations

In the new "scaling-centric" paradigm, the conventional emphasis on explicit and implicit regularization is questioned. The paper scrutinizes several influential regularization-based practices that might no longer be effective in the LLM era:

Explicit L2 Regularization: The paper shows that while L2 regularization significantly improves performance in traditional machine learning contexts, such as ImageNet, it does not provide similar benefits for LLM pretraining.
Implicit Regularization through Large Learning Rates and Small Batch Sizes: Contrary to conventional wisdom, which emphasizes large learning rates for better generalization, the paper finds that optimal learning rates for LLMs are substantially lower. Similarly, the performance improvements associated with small batch sizes in classic machine learning do not straightforwardly translate to LLM training, where a U-shaped relationship between batch size and performance exists.

The paper further introduces a novel phenomenon termed "scaling law crossover," where methods that enhance performance at smaller scales might not generalize well to larger ones. This phenomenon was illustrated through different scenarios, including stability issues, sub-optimal learning and weight decay scaling rules, and gradient normalization. Each presented case underscores the complexity of scaling, posing significant challenges for model comparison when only single training runs are feasible.

Implications and Future Directions

The implications of these findings are both profound and multifaceted:

Guiding Principles for Scaling: The identified limitations of traditional regularization methods highlight the absence of clear guiding principles for model scaling in the current ML landscape. Understanding the new factors influencing model performance as scale increases is essential for developing robust scaling strategies.
Model Comparison at Scale: The traditional model comparison methodologies are insufficient in the scaling paradigm. Emerging techniques, such as scaling law extrapolation and hyperparameter transfer, show potential but lack the reliability and theoretical grounding needed for confident application.

Speculative Outlook on Future Developments

In light of these challenges, future research could focus on several fronts:

Scaling Methodologies: There is a pressing need to develop more accurate scaling rules and methodologies that can predict model behavior at large scales reliably. This involves deepening our understanding of optimization dynamics, architectural choices, and data interactions at extensive scales.
Automation in Hyperparameter Tuning: As hyperparameter tuning becomes more complex and costly at scale, automated techniques that leverage meta-learning or Bayesian optimization could provide scalable solutions.
Broader Adoption and Testing of novel Architectures: Given the performance limitations observed with traditional architectures at scale, exploring new architectural paradigms, such as grouped-query attention or non-transformer models, could yield performance improvements.

Conclusion

This paper by Lechao Xiao is a critical examination of evolving machine learning paradigms in the context of LLMs. By challenging the validity of long-standing regularization principles and highlighting the intricacies of scaling laws, it underscores the necessity for new guiding principles and methodologies in the scaling-centric era. The findings and discussions presented in this paper are essential for researchers navigating the rapidly evolving landscape of large-scale machine learning and offer a foundational perspective for future explorations.

PDF Markdown

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Authors (1)

Lechao Xiao

Tweets

https://twitter.com/cloneofsimo/status/1854498464861933948

https://twitter.com/maksym_andr/status/1838530434793099617

https://twitter.com/Locchiu/status/1841609569937559837

https://twitter.com/hi_tysam/status/1846163208480215540

https://twitter.com/xlr8harder/status/1841886716212244526

https://twitter.com/swyx/status/1938357040042414519