Algorithmic progress in language models (2403.05812v1)
Abstract: We investigate the rate at which algorithms for pre-training LLMs have improved since the advent of deep learning. Using a dataset of over 200 LLM evaluations on Wikitext and Penn Treebank spanning 2012-2023, we find that the compute required to reach a set performance threshold has halved approximately every 8 months, with a 95% confidence interval of around 5 to 14 months, substantially faster than hardware gains per Moore's Law. We estimate augmented scaling laws, which enable us to quantify algorithmic progress and determine the relative contributions of scaling models versus innovations in training algorithms. Despite the rapid pace of algorithmic progress and the development of new architectures such as the transformer, our analysis reveals that the increase in compute made an even larger contribution to overall performance improvements over this time period. Though limited by noisy benchmark data, our analysis quantifies the rapid progress in LLMing, shedding light on the relative contributions from compute and algorithms.
- “Neuro-symbolic language modeling with automaton-augmented retrieval” In International Conference on Machine Learning, 2022, pp. 468–485 PMLR
- “Adaptive Input Representations for Neural Language Modeling” arXiv, 2018 DOI: 10.48550/ARXIV.1809.10853
- Robert E Bixby “A brief history of linear and mixed-integer programming computation” In Documenta Mathematica 2012, 2012, pp. 107–121
- “Language Models are Few-Shot Learners” arXiv, 2020 DOI: 10.48550/ARXIV.2005.14165
- “Palm: Scaling language modeling with pathways” In Journal of Machine Learning Research 24.240, 2023, pp. 1–113
- “Training Verifiers to Solve Math Word Problems” arXiv, 2021 DOI: 10.48550/ARXIV.2110.14168
- “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness” In ArXiv abs/2205.14135, 2022 URL: https://api.semanticscholar.org/CorpusID:249151871
- Tom Davidson “What a compute-centric framework says about AI takeoff speeds”, https://www.alignmentforum.org/posts/Gc9FGtdXhK9sCSEYu/what-a-compute-centric-framework-says-about-ai-takeoff, 2023
- “AI capabilities can be significantly improved without expensive retraining” In arXiv preprint arXiv:2312.07413, 2023
- Florian E Dorner “Measuring progress in deep reinforcement learning sample efficiency” In arXiv preprint arXiv:2102.04881, 2021
- Jasha Droppo and Oguz H. Elibol “Scaling Laws for Acoustic Models” In Interspeech, 2021 URL: https://api.semanticscholar.org/CorpusID:235458551
- “Algorithmic progress in computer vision” arXiv, 2022 DOI: 10.48550/ARXIV.2212.05153
- Johannes K Fichte, Markus Hecher and Stefan Szeider “A time leap challenge for SAT-solving” In International Conference on Principles and Practice of Constraint Programming, 2020, pp. 267–285 Springer
- “The Pile: An 800GB dataset of diverse text for language modeling” In arXiv preprint arXiv:2101.00027, 2020
- “A framework for few-shot language model evaluation” In Version v0. 0.1. Sept, 2021
- Google Gemini Team “Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context”, 2024 URL: https://storage.googleapis.com/deepmind-media/gemini/gemini_v1_5_report.pdf
- “Mamba: Linear-Time Sequence Modeling with Selective State Spaces” In ArXiv abs/2312.00752, 2023 URL: https://api.semanticscholar.org/CorpusID:265551773
- “Textbooks Are All You Need” In arXiv preprint arXiv:2306.11644, 2023
- “Measuring the algorithmic efficiency of neural networks” In arXiv preprint arXiv:2005.04305, 2020
- “Training Compute-Optimal Large Language Models” In arXiv preprint arXiv:2203.15556, 2022
- Jie Huang and Kevin Chen-Chuan Chang “Towards Reasoning in Large Language Models: A Survey” In arXiv preprint arXiv:2212.10403, 2022
- Hugging Face “Perplexity of fixed-length models” [Online; accessed 14-Nov-2023], https://huggingface.co/docs/transformers/perplexity, 2023
- “A survey on neural network language models” In arXiv preprint arXiv:1906.03591, 2019
- “Challenges and applications of large language models” In arXiv preprint arXiv:2307.10169, 2023
- “Scaling laws for neural language models” In arXiv preprint arXiv:2001.08361, 2020
- Andrej Karpathy “Deep Neural Nets: 33 years ago and 33 years from now” [Online; accessed 21-July-2022], http://karpathy.github.io/2022/03/14/lecun1989/, 2022
- “Progress in mathematical programming solvers from 2001 to 2020” In EURO Journal on Computational Optimization 10 Elsevier, 2022, pp. 100031
- “AlphaCode 2 Technical Report”, 2023 URL: https://storage.googleapis.com/deepmind-media/AlphaCode2/AlphaCode2_Tech_Report.pdf
- “Backpropagation applied to handwritten zip code recognition” In Neural computation 1.4 MIT Press, 1989, pp. 541–551
- “Competition-level code generation with alphacode” In arXiv preprint arXiv:2203.07814, 2022
- “World Model on Million-Length Video And Language With RingAttention” In ArXiv abs/2402.08268, 2024 URL: https://api.semanticscholar.org/CorpusID:267637090
- “When less is more: Investigating data pruning for pretraining llms at scale” In arXiv preprint arXiv:2309.04564, 2023
- “Pointer sentinel mixture models” In arXiv preprint arXiv:1609.07843, 2016
- “Augmented language models: a survey” In arXiv preprint arXiv:2302.07842, 2023
- “Scaling Data-Constrained Language Models” In ArXiv abs/2305.16264, 2023 URL: https://api.semanticscholar.org/CorpusID:258888192
- OpenAI “GPT-4 Technical Report”, 2023 URL: https://cdn.openai.com/papers/gpt-4.pdf
- “GPT-4 Architecture, Infrastructure, Training Dataset, Costs, Vision, MoE”, 2023 URL: https://www.semianalysis.com/p/gpt-4-architecture-infrastructure
- “Language Models are Unsupervised Multitask Learners”, 2019
- “Scaling Language Models: Methods, Analysis & Insights from Training Gopher” arXiv, 2021 DOI: 10.48550/ARXIV.2112.11446
- “Estimating training compute of Deep Learning models”, 2022
- Noam Shazeer “Fast transformer decoding: One write-head is all you need” In arXiv preprint arXiv:1911.02150, 2019
- Yash Sherry and Neil C Thompson “How fast do algorithms improve?[point of view]” In Proceedings of the IEEE 109.11 IEEE, 2021, pp. 1768–1777
- “Beyond neural scaling laws: beating power law scaling via data pruning” In Advances in Neural Information Processing Systems 35, 2022, pp. 19523–19536
- Kaili Sun, Xudong Luo and Michael Y Luo “A survey of pretrained language models” In Knowledge Science, Engineering and Management: 15th International Conference, KSEM 2022, Singapore, August 6–8, 2022, Proceedings, Part II, 2022, pp. 442–456 Springer
- Sho Takase, Jun Suzuki and Masaaki Nagata “Direct Output Connection for a High-Rank Language Model” In ArXiv abs/1808.10143, 2018 URL: https://api.semanticscholar.org/CorpusID:52138320
- “Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?” In ArXiv abs/2207.10551, 2022 URL: https://api.semanticscholar.org/CorpusID:250920512
- Ann Taylor, Mitchell Marcus and Beatrice Santorini “The Penn treebank: an overview” In Treebanks: Building and using parsed corpora Springer, 2003, pp. 5–22
- “Solving olympiad geometry without human demonstrations” In Nature 625.7995 Nature Publishing Group, 2024, pp. 476–482
- “Attention is All you Need” In ArXiv abs/1706.03762, 2017
- Harm Vries “In the long (context) run”, 2023 URL: https://www.harmdevries.com/post/context-length/
- Chenguang Wang, Mu Li and Alexander J Smola “Language models with transformers” In arXiv preprint arXiv:1904.09408, 2019
- “Effective Long-Context Scaling of Foundation Models” In ArXiv abs/2309.16039, 2023 URL: https://api.semanticscholar.org/CorpusID:263134982
- “A survey of large language models” In arXiv preprint arXiv:2303.18223, 2023