Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 56 tok/s
Gemini 2.5 Pro 38 tok/s Pro
GPT-5 Medium 26 tok/s Pro
GPT-5 High 22 tok/s Pro
GPT-4o 84 tok/s Pro
Kimi K2 182 tok/s Pro
GPT OSS 120B 420 tok/s Pro
Claude Sonnet 4.5 30 tok/s Pro
2000 character limit reached

Critical Data Size of Language Models from a Grokking Perspective (2401.10463v3)

Published 19 Jan 2024 in cs.CL, cs.AI, and cs.LG

Abstract: We explore the critical data size in LLMs, a threshold that marks a fundamental shift from quick memorization to slow generalization. We formalize the phase transition under the grokking configuration into the Data Efficiency Hypothesis and identify data insufficiency, sufficiency, and surplus regimes in LLMs training dynamics. We develop a grokking configuration to reproduce grokking on simplistic LLMs stably by rescaling initialization and weight decay. We show that generalization occurs only when LLMs reach a critical size. We analyze grokking across sample-wise and model-wise, verifying the proposed data efficiency hypothesis. Our experiments reveal smoother phase transitions occurring at the critical dataset size for language datasets. As the model size increases, this critical point also becomes larger, indicating that larger models require more data. Our results deepen the understanding of LLM training, offering a novel perspective on the role of data in the learning mechanism of LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (26)
  1. Google, G. T. Gemini: A family of highly capable multimodal models. 2023. URL https://api.semanticscholar.org/CorpusID:266361876.
  2. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
  3. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  4. Mnist handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, 2, 2010.
  5. Towards understanding grokking: An effective theory of representation learning. Advances in Neural Information Processing Systems, 35:34651–34663, 2022a.
  6. Omnigrok: Grokking beyond algorithmic data. arXiv preprint arXiv:2210.01117, 2022b.
  7. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp.  142–150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/P11-1015.
  8. When less is more: Investigating data pruning for pretraining llms at scale. arXiv preprint arXiv:2309.04564, 2023.
  9. Inverse scaling: When bigger isn’t better. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=DwgRm72GQF. Featured Certification.
  10. Deep double descent: Where bigger models and more data hurt. Journal of Statistical Mechanics: Theory and Experiment, 2021(12):124003, 2021.
  11. Progress measures for grokking via mechanistic interpretability. arXiv preprint arXiv:2301.05217, 2023.
  12. Predicting grokking long before it happens: A look into the loss landscape of models which grok. arXiv preprint arXiv:2306.13253, 2023.
  13. OpenAI. Gpt-4 technical report. 2023. URL https://api.semanticscholar.org/CorpusID:257532815.
  14. Grokking: Generalization beyond overfitting on small algorithmic datasets. arXiv preprint arXiv:2201.02177, 2022.
  15. Quantum chemistry structures and properties of 134 kilo molecules. Scientific Data, 1, 2014.
  16. Beyond neural scaling laws: beating power law scaling via data pruning. Advances in Neural Information Processing Systems, 35:19523–19536, 2022.
  17. The slingshot mechanism: An empirical study of adaptive optimizers and the grokking phenomenon. arXiv preprint arXiv:2206.04817, 2022.
  18. Mixtures of probabilistic principal component analyzers. Neural computation, 11(2):443–482, 1999.
  19. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  20. Explaining grokking through circuit efficiency. arXiv preprint arXiv:2309.02390, 2023.
  21. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022.
  22. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp.  38–45, Online, October 2020. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/2020.emnlp-demos.6.
  23. Data selection for language models via importance resampling. arXiv preprint arXiv:2302.03169, 2023.
  24. Character-level Convolutional Networks for Text Classification. arXiv:1509.01626 [cs], September 2015.
  25. Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206, 2023.
  26. Pad: Program-aided distillation specializes large models in reasoning. arXiv preprint arXiv:2305.13888, 2023.
Citations (9)

Summary

We haven't generated a summary for this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 4 posts and received 2 likes.