Critical Data Size of Language Models from a Grokking Perspective (2401.10463v3)
Abstract: We explore the critical data size in LLMs, a threshold that marks a fundamental shift from quick memorization to slow generalization. We formalize the phase transition under the grokking configuration into the Data Efficiency Hypothesis and identify data insufficiency, sufficiency, and surplus regimes in LLMs training dynamics. We develop a grokking configuration to reproduce grokking on simplistic LLMs stably by rescaling initialization and weight decay. We show that generalization occurs only when LLMs reach a critical size. We analyze grokking across sample-wise and model-wise, verifying the proposed data efficiency hypothesis. Our experiments reveal smoother phase transitions occurring at the critical dataset size for language datasets. As the model size increases, this critical point also becomes larger, indicating that larger models require more data. Our results deepen the understanding of LLM training, offering a novel perspective on the role of data in the learning mechanism of LLMs.
- Google, G. T. Gemini: A family of highly capable multimodal models. 2023. URL https://api.semanticscholar.org/CorpusID:266361876.
- Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
- Mnist handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, 2, 2010.
- Towards understanding grokking: An effective theory of representation learning. Advances in Neural Information Processing Systems, 35:34651–34663, 2022a.
- Omnigrok: Grokking beyond algorithmic data. arXiv preprint arXiv:2210.01117, 2022b.
- Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 142–150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/P11-1015.
- When less is more: Investigating data pruning for pretraining llms at scale. arXiv preprint arXiv:2309.04564, 2023.
- Inverse scaling: When bigger isn’t better. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=DwgRm72GQF. Featured Certification.
- Deep double descent: Where bigger models and more data hurt. Journal of Statistical Mechanics: Theory and Experiment, 2021(12):124003, 2021.
- Progress measures for grokking via mechanistic interpretability. arXiv preprint arXiv:2301.05217, 2023.
- Predicting grokking long before it happens: A look into the loss landscape of models which grok. arXiv preprint arXiv:2306.13253, 2023.
- OpenAI. Gpt-4 technical report. 2023. URL https://api.semanticscholar.org/CorpusID:257532815.
- Grokking: Generalization beyond overfitting on small algorithmic datasets. arXiv preprint arXiv:2201.02177, 2022.
- Quantum chemistry structures and properties of 134 kilo molecules. Scientific Data, 1, 2014.
- Beyond neural scaling laws: beating power law scaling via data pruning. Advances in Neural Information Processing Systems, 35:19523–19536, 2022.
- The slingshot mechanism: An empirical study of adaptive optimizers and the grokking phenomenon. arXiv preprint arXiv:2206.04817, 2022.
- Mixtures of probabilistic principal component analyzers. Neural computation, 11(2):443–482, 1999.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Explaining grokking through circuit efficiency. arXiv preprint arXiv:2309.02390, 2023.
- Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022.
- Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38–45, Online, October 2020. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/2020.emnlp-demos.6.
- Data selection for language models via importance resampling. arXiv preprint arXiv:2302.03169, 2023.
- Character-level Convolutional Networks for Text Classification. arXiv:1509.01626 [cs], September 2015.
- Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206, 2023.
- Pad: Program-aided distillation specializes large models in reasoning. arXiv preprint arXiv:2305.13888, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.