Training Compute-Optimal Large Language Models (2203.15556v1)

Published 29 Mar 2022 in cs.CL and cs.LG

Abstract: We investigate the optimal model size and number of tokens for training a transformer LLM under a given compute budget. We find that current LLMs are significantly undertrained, a consequence of the recent focus on scaling LLMs whilst keeping the amount of training data constant. By training over 400 LLMs ranging from 70 million to over 16 billion parameters on 5 to 500 billion tokens, we find that for compute-optimal training, the model size and the number of training tokens should be scaled equally: for every doubling of model size the number of training tokens should also be doubled. We test this hypothesis by training a predicted compute-optimal model, Chinchilla, that uses the same compute budget as Gopher but with 70B parameters and 4$\times$ more more data. Chinchilla uniformly and significantly outperforms Gopher (280B), GPT-3 (175B), Jurassic-1 (178B), and Megatron-Turing NLG (530B) on a large range of downstream evaluation tasks. This also means that Chinchilla uses substantially less compute for fine-tuning and inference, greatly facilitating downstream usage. As a highlight, Chinchilla reaches a state-of-the-art average accuracy of 67.5% on the MMLU benchmark, greater than a 7% improvement over Gopher.

PDF Abstract

Introduction to Scalability in LLMs

The continuously evolving landscape of LLMs plays a central role in the progress of artificial intelligence, especially with models like GPT-3 and Megatron-Turing NLG garnering widespread attention. This seminal paper thoroughly investigates the effects of model size, data, and computation on LLM performance, offering a comprehensive evaluation of scalability and its limits. The researchers meticulously explore the territory of training parameters, noting impressive numerical results that shed light on scaling laws and perplexity improvements across various well-known LLM architectures.

Key Findings on Scaling Laws

At the core of the paper is an in-depth analysis of model scalability. Through rigorous experimentation, the researchers identify a set of scaling laws that reliably predict the optimal allocation of compute resources for training LLMs. They reveal that doubling the model size generally requires a more than proportional increase in data and compute to maintain performance efficacy. Contradicting naive expectations, improvements in model performance do not follow a linear trajectory with regard to model size or training data. Instead, performance gains diminish as models grow larger, indicating a sublinear scaling phenomenon.

Notably, the paper refutes the widely-held belief that larger models inherently lead to better performance. It underscores the importance of an efficient frontier in LLM development, demonstrating through empirical data that model performance only reaches optimality when model size, data, and compute are balanced carefully.

The Role of Computational Resources and Data Efficiency

The paper pays special attention to the interdependence between computational resources and data efficiency. Particularly in scenarios with constrained computational budgets, the authors emphasize the necessity for strategizing the allocation of resources toward either model size, training data, or the number of training steps. They propose that future LLMs should prioritize data quality and efficiency over mere quantity, potentially offering a clearer roadmap toward more sustainable and cost-effective AI scaling.

Moreover, the paper illustrates various "IsoFLOPs slices" and "IsoLoss contours," reinforcing the idea that achieving optimal performance is often a trade-off between factors, and a balance must be struck to maximize results.

Implications and Future Directions

The discourse laid out by the researchers offers pivotal implications for both theoretical and practical considerations in the AI field. They argue that while scaling up LLMs has certainly contributed to remarkable advancements, there is a point of diminishing returns that developers must navigate.

Looking ahead, the paper advises a shift in focus towards leveraging these scaling laws more creatively, proposing exploration into architectural innovations, alternative training methods, and more nuanced understandings of data utilization. This strategic perspective could pave the way for developing more powerful and efficient LLMs that balance the trinity of size, data, and computation more judiciously.

In conclusion, the paper not only expands our understanding of the scalability dynamics within LLMs but also directs the community towards a more enlightened approach to model development—one that harmonizes the intricate interplay of model size, computational prowess, and data acumen.