Towards Optimal Learning of Language Models (2402.17759v2)
Abstract: This work studies the general principles of improving the learning of LLMs (LMs), which aims at reducing the necessary training steps for achieving superior performance. Specifically, we present a theory for the optimal learning of LMs. We first propose an objective that optimizes LM learning by maximizing the data compression ratio in an "LM-training-as-lossless-compression" view. Then, we derive a theorem, named Learning Law, to reveal the properties of the dynamics in the optimal learning process under our objective. The theorem is then validated by experiments on a linear classification and a real-world LLMing task. Finally, we empirically verify that the optimal learning of LMs essentially stems from the improvement of the coefficients in the scaling law of LMs, indicating great promise and significance for designing practical learning acceleration methods. Our code can be found at https://aka.ms/LearningLaw.
- Towards tracing knowledge in language models back to the training data. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Findings of EMNLP, 2022.
- Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
- Intrinsic dimensionality explains the effectiveness of language model fine-tuning. In Proceedings of ACL, 2021.
- Efficient online data mixing for language model pre-training. In NeurIPS 2023 Workshop on R0-FoMo: Robustness of Few-shot and Zero-shot Learning in Large Foundation Models, 2023.
- SemDeDup: Data-efficient learning at web-scale through semantic deduplication. In ICLR 2023 Workshop on Mathematical and Empirical Understanding of Foundation Models, 2023.
- HH Bauschke and PL Combettes. Convex Analysis and Monotone Operator Theory in Hilbert Spaces. Springer, 2011.
- Fabrice Bellard. NNCP: Lossless data compression with neural networks, 2019.
- Dimitri Bertsekas. Nonlinear programming, volume 4. Athena scientific, 2016.
- On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
- Language models are few-shot learners. In Proceedings of NeurIPS, 2020.
- AUC optimization vs. error rate minimization. In Proceedings of NeurIPS, 2003.
- h2oGPT: Democratizing large language models. arXiv preprint arXiv:2306.08161, 2023.
- PaLM: Scaling language modeling with pathways. JMLR, 2023.
- Skill-it! a data-driven skills framework for understanding and training language models. In Proceedings of NeurIPS, 2023.
- Language modeling is compression. In Proceddings of ICLR, 2024.
- Continuous vs. discrete optimization of deep neural networks. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan, editors, Proceedings of NeurIPS, 2021.
- TinyStories: How small can language models be and still speak coherent english? arXiv preprint arXiv:2305.07759, 2023.
- Andreas Engel. Statistical mechanics of learning. Cambridge University Press, 2001.
- Bilevel optimization to learn training distributions for language modeling under domain shift. In NeurIPS 2023 Workshop on Distribution Shifts: New Frontiers with Foundation Models, 2023.
- PPT: Pre-trained prompt tuning for few-shor learning. In Proceedings of ACL, 2022.
- Calculus of variations. Courier Corporation, 2000.
- The curious case of neural text degeneration. In Proceedings of ICLR, 2020.
- Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
- Pre-trained models: Past, present and future. AI Open, 2021.
- Deep residual learning for image recognition. In Proceedings of CVPR, 2016.
- Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
- Adam: A method for stochastic optimization. In Proceedings of ICLR, 2015.
- Characterizing and addressing the issue of over-smoothing in neural autoregressive sequence modeling. In Proceedings of AACL, 2022.
- Understanding black-box predictions via influence functions. In Proceedings of ICML, 2017.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
- The goldilocks effect: Human infants allocate attention to visual sequences that are neither too simple nor too complex. PloS one, 7(5):e36399, 2012.
- Sophia: A scalable stochastic second-order optimizer for language model pre-training. In Proceedings of ICLR, 2024.
- Same pre-training loss, better downstream: Implicit bias matters for language models. In Proceedings of ICML, 2023.
- Prioritized training on points that are learnable, worth learning, and not yet learnt. In Proceedings of ICML, 2022.
- Trace: A fast transformer-based general-purpose lossless compressor. In Proceedings of WWW, New York, NY, USA, 2022.
- Janet Metcalfe. Metacognitive judgments and control of study. Current directions in psychological science, 18(3):159–163, 2009.
- An empirical model of large-batch training. arXiv preprint arXiv:1812.06162, 2018.
- A logical calculus of the ideas immanent in nervous activity. The bulletin of mathematical biophysics, 1943.
- Deep Double Descent: Where bigger models and more data hurt. In Proceedings of ICLR, 2019.
- OpenAI. OpenAI: Introducing chatgpt, 2022.
- OpenAI. GPT-4 technical report, 2023.
- Pytorch: An imperative style, high-performance deep learning library. In Proceedings of NeurIPS, 2019.
- Estimating training data influence by tracing gradient descent. In NeurIPS, 2020.
- Lev Semenovich Pontryagin. Mathematical theory of optimal processes. Routledge, 2018.
- Optimal learning, volume 841. John Wiley & Sons, 2012.
- Jack Rae. Compression for agi, 2023.
- Learning internal representations by error propagation. In Parallel Distributed Processing: Explorations in the Microstructure of Cognition: Foundations, 1987.
- ZeRO: Memory optimizations toward training trillion parameter models. In Proceedings of SC20, 2020.
- On the origin of implicit regularization in stochastic gradient descent. In Proceedings of ICLR, 2020.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- D4: Improving llm pretraining via document de-duplication and diversification. In Proceedings of NeurIPS, 2023.
- Vladimir Vapnik. The nature of statistical learning theory. Springer science & business media, 1999.
- LLMZip: Lossless text compression using large language models. arXiv preprint arXiv:2306.04050, 2023.
- Attention is all you need. In Proceedings of NeurIPS, 2017.
- Neural text generation with unlikelihood training. In Proceedings of ICLR, 2019.
- The eighty five percent rule for optimal learning. Nature communications, 10(1):4646, 2019.
- Emergent abilities of large language models. TMLR, 2022.
- Efficient large language models: A survey. arXiv preprint arXiv:2312.03863, 2023.
- DoReMi: Optimizing data mixtures speeds up language model pretraining. In Proceedings of NeurIPS, 2023.
- Data selection for language models via importance resampling. In Proceedings of NeurIPS, 2023.
- On layer normalization in the transformer architecture. In Proceedings of ICML, 2020.
- Large batch optimization for deep learning: Training bert in 76 minutes. In Proceedings of ICLR, 2020.
- Accelerating training of transformer-based language models with progressive layer dropping. Proceedings of NeurIPS, 2020.
- Why gradient clipping accelerates training: A theoretical justification for adaptivity. In Proceedings of ICLR, 2020.
- Yuxian Gu (21 papers)
- Li Dong (154 papers)
- Yaru Hao (16 papers)
- Qingxiu Dong (39 papers)
- Minlie Huang (226 papers)
- Furu Wei (291 papers)