Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
166 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Algorithmic progress in language models (2403.05812v1)

Published 9 Mar 2024 in cs.CL and cs.AI

Abstract: We investigate the rate at which algorithms for pre-training LLMs have improved since the advent of deep learning. Using a dataset of over 200 LLM evaluations on Wikitext and Penn Treebank spanning 2012-2023, we find that the compute required to reach a set performance threshold has halved approximately every 8 months, with a 95% confidence interval of around 5 to 14 months, substantially faster than hardware gains per Moore's Law. We estimate augmented scaling laws, which enable us to quantify algorithmic progress and determine the relative contributions of scaling models versus innovations in training algorithms. Despite the rapid pace of algorithmic progress and the development of new architectures such as the transformer, our analysis reveals that the increase in compute made an even larger contribution to overall performance improvements over this time period. Though limited by noisy benchmark data, our analysis quantifies the rapid progress in LLMing, shedding light on the relative contributions from compute and algorithms.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. “Neuro-symbolic language modeling with automaton-augmented retrieval” In International Conference on Machine Learning, 2022, pp. 468–485 PMLR
  2. “Adaptive Input Representations for Neural Language Modeling” arXiv, 2018 DOI: 10.48550/ARXIV.1809.10853
  3. Robert E Bixby “A brief history of linear and mixed-integer programming computation” In Documenta Mathematica 2012, 2012, pp. 107–121
  4. “Language Models are Few-Shot Learners” arXiv, 2020 DOI: 10.48550/ARXIV.2005.14165
  5. “Palm: Scaling language modeling with pathways” In Journal of Machine Learning Research 24.240, 2023, pp. 1–113
  6. “Training Verifiers to Solve Math Word Problems” arXiv, 2021 DOI: 10.48550/ARXIV.2110.14168
  7. “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness” In ArXiv abs/2205.14135, 2022 URL: https://api.semanticscholar.org/CorpusID:249151871
  8. Tom Davidson “What a compute-centric framework says about AI takeoff speeds”, https://www.alignmentforum.org/posts/Gc9FGtdXhK9sCSEYu/what-a-compute-centric-framework-says-about-ai-takeoff, 2023
  9. “AI capabilities can be significantly improved without expensive retraining” In arXiv preprint arXiv:2312.07413, 2023
  10. Florian E Dorner “Measuring progress in deep reinforcement learning sample efficiency” In arXiv preprint arXiv:2102.04881, 2021
  11. Jasha Droppo and Oguz H. Elibol “Scaling Laws for Acoustic Models” In Interspeech, 2021 URL: https://api.semanticscholar.org/CorpusID:235458551
  12. “Algorithmic progress in computer vision” arXiv, 2022 DOI: 10.48550/ARXIV.2212.05153
  13. Johannes K Fichte, Markus Hecher and Stefan Szeider “A time leap challenge for SAT-solving” In International Conference on Principles and Practice of Constraint Programming, 2020, pp. 267–285 Springer
  14. “The Pile: An 800GB dataset of diverse text for language modeling” In arXiv preprint arXiv:2101.00027, 2020
  15. “A framework for few-shot language model evaluation” In Version v0. 0.1. Sept, 2021
  16. Google Gemini Team “Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context”, 2024 URL: https://storage.googleapis.com/deepmind-media/gemini/gemini_v1_5_report.pdf
  17. “Mamba: Linear-Time Sequence Modeling with Selective State Spaces” In ArXiv abs/2312.00752, 2023 URL: https://api.semanticscholar.org/CorpusID:265551773
  18. “Textbooks Are All You Need” In arXiv preprint arXiv:2306.11644, 2023
  19. “Measuring the algorithmic efficiency of neural networks” In arXiv preprint arXiv:2005.04305, 2020
  20. “Training Compute-Optimal Large Language Models” In arXiv preprint arXiv:2203.15556, 2022
  21. Jie Huang and Kevin Chen-Chuan Chang “Towards Reasoning in Large Language Models: A Survey” In arXiv preprint arXiv:2212.10403, 2022
  22. Hugging Face “Perplexity of fixed-length models” [Online; accessed 14-Nov-2023], https://huggingface.co/docs/transformers/perplexity, 2023
  23. “A survey on neural network language models” In arXiv preprint arXiv:1906.03591, 2019
  24. “Challenges and applications of large language models” In arXiv preprint arXiv:2307.10169, 2023
  25. “Scaling laws for neural language models” In arXiv preprint arXiv:2001.08361, 2020
  26. Andrej Karpathy “Deep Neural Nets: 33 years ago and 33 years from now” [Online; accessed 21-July-2022], http://karpathy.github.io/2022/03/14/lecun1989/, 2022
  27. “Progress in mathematical programming solvers from 2001 to 2020” In EURO Journal on Computational Optimization 10 Elsevier, 2022, pp. 100031
  28. “AlphaCode 2 Technical Report”, 2023 URL: https://storage.googleapis.com/deepmind-media/AlphaCode2/AlphaCode2_Tech_Report.pdf
  29. “Backpropagation applied to handwritten zip code recognition” In Neural computation 1.4 MIT Press, 1989, pp. 541–551
  30. “Competition-level code generation with alphacode” In arXiv preprint arXiv:2203.07814, 2022
  31. “World Model on Million-Length Video And Language With RingAttention” In ArXiv abs/2402.08268, 2024 URL: https://api.semanticscholar.org/CorpusID:267637090
  32. “When less is more: Investigating data pruning for pretraining llms at scale” In arXiv preprint arXiv:2309.04564, 2023
  33. “Pointer sentinel mixture models” In arXiv preprint arXiv:1609.07843, 2016
  34. “Augmented language models: a survey” In arXiv preprint arXiv:2302.07842, 2023
  35. “Scaling Data-Constrained Language Models” In ArXiv abs/2305.16264, 2023 URL: https://api.semanticscholar.org/CorpusID:258888192
  36. OpenAI “GPT-4 Technical Report”, 2023 URL: https://cdn.openai.com/papers/gpt-4.pdf
  37. “GPT-4 Architecture, Infrastructure, Training Dataset, Costs, Vision, MoE”, 2023 URL: https://www.semianalysis.com/p/gpt-4-architecture-infrastructure
  38. “Language Models are Unsupervised Multitask Learners”, 2019
  39. “Scaling Language Models: Methods, Analysis & Insights from Training Gopher” arXiv, 2021 DOI: 10.48550/ARXIV.2112.11446
  40. “Estimating training compute of Deep Learning models”, 2022
  41. Noam Shazeer “Fast transformer decoding: One write-head is all you need” In arXiv preprint arXiv:1911.02150, 2019
  42. Yash Sherry and Neil C Thompson “How fast do algorithms improve?[point of view]” In Proceedings of the IEEE 109.11 IEEE, 2021, pp. 1768–1777
  43. “Beyond neural scaling laws: beating power law scaling via data pruning” In Advances in Neural Information Processing Systems 35, 2022, pp. 19523–19536
  44. Kaili Sun, Xudong Luo and Michael Y Luo “A survey of pretrained language models” In Knowledge Science, Engineering and Management: 15th International Conference, KSEM 2022, Singapore, August 6–8, 2022, Proceedings, Part II, 2022, pp. 442–456 Springer
  45. Sho Takase, Jun Suzuki and Masaaki Nagata “Direct Output Connection for a High-Rank Language Model” In ArXiv abs/1808.10143, 2018 URL: https://api.semanticscholar.org/CorpusID:52138320
  46. “Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?” In ArXiv abs/2207.10551, 2022 URL: https://api.semanticscholar.org/CorpusID:250920512
  47. Ann Taylor, Mitchell Marcus and Beatrice Santorini “The Penn treebank: an overview” In Treebanks: Building and using parsed corpora Springer, 2003, pp. 5–22
  48. “Solving olympiad geometry without human demonstrations” In Nature 625.7995 Nature Publishing Group, 2024, pp. 476–482
  49. “Attention is All you Need” In ArXiv abs/1706.03762, 2017
  50. Harm Vries “In the long (context) run”, 2023 URL: https://www.harmdevries.com/post/context-length/
  51. Chenguang Wang, Mu Li and Alexander J Smola “Language models with transformers” In arXiv preprint arXiv:1904.09408, 2019
  52. “Effective Long-Context Scaling of Foundation Models” In ArXiv abs/2309.16039, 2023 URL: https://api.semanticscholar.org/CorpusID:263134982
  53. “A survey of large language models” In arXiv preprint arXiv:2303.18223, 2023
Citations (10)

Summary

  • The paper demonstrates that algorithmic innovations exponentially enhance the effective use of computational resources in LLM pre-training.
  • Empirical evaluations across 200+ benchmarks show that compute requirements for a given performance level halve every 8 months.
  • The study highlights the transformative impact of the transformer architecture, yielding significant compute-equivalent gains over previous models.

Algorithmic Progress and Compute Scaling in Pre-Training of LLMs: An Empirical Analysis

Introduction

The fast-paced advances in LLMing, propelled by increasingly capable LLMs, have captured widespread attention. These LLMs serve a pivotal role across a broad spectrum of applications, ranging from natural language processing tasks to generating complex textual content. Central to these advancements are not only the burgeoning computational resources but also significant algorithmic developments that optimize the use of such resources. In this context, understanding the interplay between algorithmic innovation and hardware scalability becomes crucial for assessing future directions in the field of LLMing.

Methodology

Our paper employs an empirical approach to dissect the contributions of algorithmic enhancements and compute scaling in the evolution of LLM performance. We ground our analysis in a dataset encompassing over 200 LLM evaluations on benchmarks like WikiText and Penn Treebank over a period from 2012 to 2023. This comprehensive dataset allows for a nuanced exploration of the trajectory of LLMing improvements.

Model Framework

Central to our analysis is the employment of an augmented scaling law, derived from foundational works in the domain, which links model performance unequivocally to the model's scale and the computational resources. Through careful adjustments, we incorporate notions of effective data' andeffective model size'—terms that quantify the algorithmic efficiency gains over time. Our approach posits that consistent algorithmic progress manifests as an exponential enhancement in the `effectiveness' of these resources, thus facilitating performance gains at reduced compute costs.

Empirical Findings

Our findings are revealing. We estimate that the compute requirements for a given performance threshold on LLMing tasks have approximately halved every 8 months since 2012. This rate outpaces hardware improvements following Moore's Law, highlighting the brisk pace of algorithmic innovation in LLMing. Nonetheless, our analysis also discerns that the lion's share of recent performance upswings is attributable to increases in computational resources, with algorithmic improvements playing a lesser, though still significant, role.

The Transformer Architecture: A Case Study

The advent of the transformer architecture marks a watershed moment in LLM development. Our analysis assigns a compute-equivalent gain to the transformer, quantifying its contribution relative to preceding architectures. The transformer is shown to substantially reduce the compute required for a given level of performance, underscoring its pivotal role in the accelerated progress of LLMing capabilities.

Implications and Future Directions

Understanding the relative contributions of algorithmic progress and compute scaling offers valuable insights into the potential trajectories for the development of LLMs. While the rapid scale-up of computational resources has undeniably fueled recent advances, the sustained pace of algorithmic innovation underscores the field's robust foundation in research and development. Looking ahead, continued exploration of novel architectures, optimization techniques, and efficient training methods will be critical in navigating the computational and environmental constraints facing the next generation of LLMs.

Conclusion

Our paper provides a structured empirical analysis of the advancements in pre-training LLMs, emphasizing the symbiotic relationship between algorithmic progress and compute scaling. By illuminating the dynamics shaping the evolution of LLMs, we contribute to a deeper understanding of the past and present trends, laying the groundwork for informed speculation on the future of LLMing.

Youtube Logo Streamline Icon: https://streamlinehq.com

HackerNews