Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Unified View of Grokking, Double Descent and Emergent Abilities: A Perspective from Circuits Competition (2402.15175v2)

Published 23 Feb 2024 in cs.LG

Abstract: Recent studies have uncovered intriguing phenomena in deep learning, such as grokking, double descent, and emergent abilities in LLMs, which challenge human intuition and are crucial for a deeper understanding of neural models. In this paper, we present a comprehensive framework that provides a unified view of these three phenomena, focusing on the competition between memorization and generalization circuits. This approach, initially employed to explain grokking, is extended in our work to encompass a wider range of model sizes and training data volumes. Our framework delineates four distinct training dynamics, each depending on varying combinations of model size and training data quantity. Utilizing this framework, we provide a detailed analysis of the double descent phenomenon and propose two verifiable predictions regarding its occurrence, both substantiated by our experimental results. Moreover, we expand our framework to the multi-task learning paradigm, demonstrating how algorithm tasks can be turned into emergent abilities. This offers a novel perspective to understand emergent abilities in LLMs.

Unveiling the Interplay Between Memorization and Generalization in LLMs

Overview of the Study

Recent advances in deep learning have revealed fascinating phenomena such as grokking, double descent, and emergent abilities in LLMs. These phenomena, while initially bewildering, provide keen insight into the underlying mechanisms of neural networks. This paper introduces a comprehensive framework centered on the competition between memorization and generalization circuits within neural models. Through extensive experimentation, the paper elucidates the critical dataset size for various model sizes, delineating a novel perspective on training dynamics across a spectrum of training data volumes.

Key Contributions

The paper makes three significant contributions to the field:

  • The introduction of a novel framework for dissecting and understanding the performance and training dynamics in relation to model size and training data volume.
  • A detailed exploration of the double descent phenomenon, highlighting the predictive methodology for its occurrence.
  • The innovative concept of transforming algorithm tasks into emergent abilities through multi-task learning, thereby offering fresh insight into the understanding of emergent abilities in LLMs.

Exploring Grokking Phenomenon

The phenomenon of grokking—where models achieve unexpected generalization ability long after attaining perfect training accuracy—has been further examined in the context of model size and critical dataset size. The paper highlights an inverse relationship between model size and the requisite amount of training data for grokking. Conversely, a direct relationship exists between model size and memorization capacity, with larger models exhibiting more robust memorization capabilities.

Illustrating Double Descent

The double descent phenomenon, characterized by a unique validation performance trend against model size, is thoroughly investigated. The research establishes that the occurrence of double descent is intimately linked to the quantity of training data in relation to the critical dataset size for model generalization. Models subjected to a lower volume of training data relative to the critical dataset size traverse through progression, ungrokking, and eventually grokking stages, showcasing the double descent curve. Conversely, ample training data leads to a consistent improvement in validation performance without the double descent phenomenon.

Emergent Abilities in Multi-Task Learning

Extending the framework to multi-task learning paradigms unveils how combining algorithm tasks with pure memorization tasks transforms the former into emergent abilities. This observation underscores the inherent challenge larger models face in developing generalization circuits when also tasked with extensive memorization. The paper suggests that the unique training dynamics in LLM pretraining, which resembles multi-task learning, may lay the foundation for emergent abilities.

Future Directions

The paper asserts that despite the revelations offered by the current framework, further research is necessary to extend these insights beyond algorithm tasks to more realistic tasks and models. This expansion is essential for a holistic understanding of the deep learning mechanisms and the varied phenomena observed in LLMs.

Conclusion

This paper provides a profound exploration of the dynamics at play between memorization and generalization circuits in neural models, offering a unified perspective on phenomena like grokking, double descent, and emergent abilities. By leveraging a novel analytical framework, the research not only elucidates these phenomena but also paves the way for future investigations into the intricate workings of LLMs and their training processes.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (23)
  1. Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849–15854, July 2019. ISSN 1091-6490. doi: 10.1073/pnas.1903070116. URL http://dx.doi.org/10.1073/pnas.1903070116.
  2. Language models are few-shot learners. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp.  1877–1901. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
  3. Broken neural scaling laws. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/pdf?id=sckjveqlCZ.
  4. A toy model of universality: Reverse engineering how networks learn group operations. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pp.  6243–6267. PMLR, 2023. URL https://proceedings.mlr.press/v202/chughtai23a.html.
  5. Unifying grokking and double descent. In NeurIPS ML Safety Workshop, 2022. URL https://openreview.net/forum?id=JqtHMZtqWm.
  6. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. J. Mach. Learn. Res., 23:120:1–120:39, 2022. URL http://jmlr.org/papers/v23/21-0998.html.
  7. Transformer feed-forward layers are key-value memories. In Moens, M., Huang, X., Specia, L., and Yih, S. W. (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pp.  5484–5495. Association for Computational Linguistics, 2021. doi: 10.18653/V1/2021.EMNLP-MAIN.446. URL https://doi.org/10.18653/v1/2021.emnlp-main.446.
  8. Predicting emergent abilities with infinite resolution evaluation, 2023.
  9. Towards understanding grokking: An effective theory of representation learning. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.), Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022. URL http://papers.nips.cc/paper_files/paper/2022/hash/dfc310e81992d2e4cedc09ac47eff13e-Abstract-Conference.html.
  10. Omnigrok: Grokking beyond algorithmic data. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/pdf?id=zDiHoIWa0q1.
  11. Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7.
  12. The quantization model of neural scaling. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=3tbTw2ga8K.
  13. Deep double descent: Where bigger models and more data hurt. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. URL https://openreview.net/forum?id=B1g5sA4twr.
  14. Progress measures for grokking via mechanistic interpretability. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/pdf?id=9XFSbDPmdW.
  15. Predicting grokking long before it happens: A look into the loss landscape of models which grok. CoRR, abs/2306.13253, 2023. doi: 10.48550/ARXIV.2306.13253. URL https://doi.org/10.48550/arXiv.2306.13253.
  16. Grokking: Generalization beyond overfitting on small algorithmic datasets. CoRR, abs/2201.02177, 2022. URL https://arxiv.org/abs/2201.02177.
  17. Are emergent abilities of large language models a mirage? In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=ITw9edRDlD.
  18. The slingshot mechanism: An empirical study of adaptive optimizers and the grokking phenomenon. CoRR, abs/2206.04817, 2022. doi: 10.48550/ARXIV.2206.04817. URL https://doi.org/10.48550/arXiv.2206.04817.
  19. Explaining grokking through circuit efficiency. CoRR, abs/2309.02390, 2023. doi: 10.48550/ARXIV.2309.02390. URL https://doi.org/10.48550/arXiv.2309.02390.
  20. Attention is all you need. In Guyon, I., von Luxburg, U., Bengio, S., Wallach, H. M., Fergus, R., Vishwanathan, S. V. N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp.  5998–6008, 2017. URL https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html.
  21. Emergent abilities of large language models. Transactions on Machine Learning Research, 2022a. ISSN 2835-8856. URL https://openreview.net/forum?id=yzkSU5zdwD. Survey Certification.
  22. Chain-of-thought prompting elicits reasoning in large language models. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.), Advances in Neural Information Processing Systems, volume 35, pp.  24824–24837. Curran Associates, Inc., 2022b. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/9d5609613524ecf4f15af0f7b31abca4-Paper-Conference.pdf.
  23. Understanding deep learning (still) requires rethinking generalization. Commun. ACM, 64(3):107–115, 2021. doi: 10.1145/3446776. URL https://doi.org/10.1145/3446776.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Yufei Huang (81 papers)
  2. Shengding Hu (34 papers)
  3. Xu Han (270 papers)
  4. Zhiyuan Liu (433 papers)
  5. Maosong Sun (337 papers)
Citations (9)