Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

102 tokens/sec

GPT-4o

59 tokens/sec

Gemini 2.5 Pro Pro

43 tokens/sec

o3 Pro

6 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

206

Unified View of Grokking, Double Descent and Emergent Abilities: A Perspective from Circuits Competition (2402.15175v2)

Published 23 Feb 2024 in cs.LG

Abstract: Recent studies have uncovered intriguing phenomena in deep learning, such as grokking, double descent, and emergent abilities in LLMs, which challenge human intuition and are crucial for a deeper understanding of neural models. In this paper, we present a comprehensive framework that provides a unified view of these three phenomena, focusing on the competition between memorization and generalization circuits. This approach, initially employed to explain grokking, is extended in our work to encompass a wider range of model sizes and training data volumes. Our framework delineates four distinct training dynamics, each depending on varying combinations of model size and training data quantity. Utilizing this framework, we provide a detailed analysis of the double descent phenomenon and propose two verifiable predictions regarding its occurrence, both substantiated by our experimental results. Moreover, we expand our framework to the multi-task learning paradigm, demonstrating how algorithm tasks can be turned into emergent abilities. This offers a novel perspective to understand emergent abilities in LLMs.

PDF HTML Abstract

Unveiling the Interplay Between Memorization and Generalization in LLMs

Overview of the Study

Recent advances in deep learning have revealed fascinating phenomena such as grokking, double descent, and emergent abilities in LLMs. These phenomena, while initially bewildering, provide keen insight into the underlying mechanisms of neural networks. This paper introduces a comprehensive framework centered on the competition between memorization and generalization circuits within neural models. Through extensive experimentation, the paper elucidates the critical dataset size for various model sizes, delineating a novel perspective on training dynamics across a spectrum of training data volumes.

Key Contributions

The paper makes three significant contributions to the field:

The introduction of a novel framework for dissecting and understanding the performance and training dynamics in relation to model size and training data volume.
A detailed exploration of the double descent phenomenon, highlighting the predictive methodology for its occurrence.
The innovative concept of transforming algorithm tasks into emergent abilities through multi-task learning, thereby offering fresh insight into the understanding of emergent abilities in LLMs.

Exploring Grokking Phenomenon

The phenomenon of grokking—where models achieve unexpected generalization ability long after attaining perfect training accuracy—has been further examined in the context of model size and critical dataset size. The paper highlights an inverse relationship between model size and the requisite amount of training data for grokking. Conversely, a direct relationship exists between model size and memorization capacity, with larger models exhibiting more robust memorization capabilities.

Illustrating Double Descent

The double descent phenomenon, characterized by a unique validation performance trend against model size, is thoroughly investigated. The research establishes that the occurrence of double descent is intimately linked to the quantity of training data in relation to the critical dataset size for model generalization. Models subjected to a lower volume of training data relative to the critical dataset size traverse through progression, ungrokking, and eventually grokking stages, showcasing the double descent curve. Conversely, ample training data leads to a consistent improvement in validation performance without the double descent phenomenon.

Emergent Abilities in Multi-Task Learning

Extending the framework to multi-task learning paradigms unveils how combining algorithm tasks with pure memorization tasks transforms the former into emergent abilities. This observation underscores the inherent challenge larger models face in developing generalization circuits when also tasked with extensive memorization. The paper suggests that the unique training dynamics in LLM pretraining, which resembles multi-task learning, may lay the foundation for emergent abilities.

Future Directions

The paper asserts that despite the revelations offered by the current framework, further research is necessary to extend these insights beyond algorithm tasks to more realistic tasks and models. This expansion is essential for a holistic understanding of the deep learning mechanisms and the varied phenomena observed in LLMs.

Conclusion

This paper provides a profound exploration of the dynamics at play between memorization and generalization circuits in neural models, offering a unified perspective on phenomena like grokking, double descent, and emergent abilities. By leveraging a novel analytical framework, the research not only elucidates these phenomena but also paves the way for future investigations into the intricate workings of LLMs and their training processes.

PDF Markdown Bookmark Chat (Pro)

References (23)

Authors (5)

Yufei Huang (81 papers)
Shengding Hu (34 papers)
Xu Han (270 papers)
Zhiyuan Liu (433 papers)
Maosong Sun (337 papers)

Citations (9)

View on Semantic Scholar

Tweets

https://twitter.com/DeanHu11/status/1761995060521140462

https://twitter.com/teortaxesTex/status/1779283722040176648

https://twitter.com/yufei_huang1999/status/1762024763344687556

https://twitter.com/fly51fly/status/1762235582011629737

https://twitter.com/datagenproc/status/1870493400790954066

https://twitter.com/DonaldPepe1/status/1762934780365004828