Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws (2401.00448v2)

Published 31 Dec 2023 in cs.LG and cs.CL

Abstract: LLM scaling laws are empirical formulas that estimate changes in model quality as a result of increasing parameter count and training data. However, these formulas, including the popular Deepmind Chinchilla scaling laws, neglect to include the cost of inference. We modify the Chinchilla scaling laws to calculate the optimal LLM parameter count and pre-training data size to train and deploy a model of a given quality and inference demand. We conduct our analysis both in terms of a compute budget and real-world costs and find that LLM researchers expecting reasonably large inference demand (~1B requests) should train models smaller and longer than Chinchilla-optimal. Furthermore, we train 47 models of varying sizes and parameter counts to validate our formula and find that model quality continues to improve as we scale tokens per parameter to extreme ranges (up to 10,000). Finally, we ablate the procedure used to fit the Chinchilla scaling law coefficients and find that developing scaling laws only from data collected at typical token/parameter ratios overestimates the impact of additional tokens at these extreme ranges.

Introduction

LLMs have significantly impacted the field of artificial intelligence, especially in understanding and generating human language. As these models grow larger, it becomes crucial to understand the scaling laws that govern changes in model quality with increases in parameter count and training data. The Chinchilla scaling laws, coined by DeepMind, are a set of empirical formulas that estimate the optimal parameter count and pre-training data size for LLMs. While these have been influential in guiding model training, they focus primarily on training costs, neglecting inference costs, which can be substantial. This paper introduces a new approach to LLM scaling laws that incorporate inference costs to optimize both computational and financial resources.

Computational Optimality

The authors present an adjusted version of the Chinchilla scaling laws that take inference costs into account. They define model quality via cross-entropy loss and computational cost through floating-point operations (FLOPs). Their analysis shows that LLM practitioners expecting substantial inference demand should consider training models that are smaller and trained for longer periods than what would be recommended by Chinchilla laws. This adjusted framework implies that as inference requests increase, the total computational cost changes, skewing towards models that are trained with more data but have fewer parameters.

Estimating Real-World Cost Optimality

Focusing purely on minimizing FLOPs may not align with real-world conditions where various factors, such as hardware utilization and the costs associated with training versus inference, differ significantly. This paper extends the revisions to the Chinchilla scaling laws by including a model for estimating actual costs. The authors consider training and inference on different hardware types, the effects of model quantization before inference, and differences in utilization between training and inference. Real-world cost analysis suggests even greater emphasis on small and long-trained models to reduce inference costs, accounting for significant differences in utilization and costs of training versus inference.

Conclusion

The paper culminates in a revised set of scaling laws for LLMs that address both computational efficiency and real-world cost considerations. It argues for a more nuanced approach to model training that considers the lifespan and demand for inference, steering away from training the largest models possible towards more economically optimized solutions. While admitting the need for experimental validation and questioning whether these laws hold in extreme conditions, the authors establish a comprehensive platform for future work in LLM scaling, potentially affecting how future models are developed.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (21)
  1. H. De Vries. Go smol or go home, 2023. URL https://www.harmdevries.com/post/model-size-vs-compute-overhead/.
  2. Cerebras-gpt: Open compute-optimal language models trained on the cerebras wafer-scale cluster, 2023.
  3. Gptq: Accurate post-training quantization for generative pre-trained transformers, 2023.
  4. Training compute-optimal large language models, 2022.
  5. Scaling laws for neural language models, 2020.
  6. W. Knight. Openai’s ceo says the age of giant ai models is already over. Wired, 2023. ISSN 1059-1028. URL https://www.wired.com/story/openai-ceo-sam-altman-the-age-of-giant-ai-models-is-already-over/.
  7. Reducing activation recomputation in large transformer models, 2022.
  8. L. Labs. Gpu cloud - vms for deep learning. https://lambdalabs.com/service/gpu-cloud, 2023. Accessed 2023-10-02.
  9. Scaling data-constrained language models, 2023.
  10. NVIDIA. Nvidia a100 datasheet, 2021. URL https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/a100/pdf/nvidia-a100-datasheet-us-nvidia-1758950-r4-web.pdf.
  11. OpenAI and A. Pilipiszyn. Gpt-3 powers the next generation of apps, Mar 2021. URL https://openai.com/blog/gpt-3-apps.
  12. Efficiently scaling transformer inference, 2022.
  13. Scaling language models: Methods, analysis & insights from training gopher, 2022.
  14. Exploring the limits of transfer learning with a unified text-to-text transformer, 2023.
  15. N. Shazeer and D. d. Freitas. Introducing character, Dec 2022. URL https://blog.character.ai/introducing-character/.
  16. Llama: Open and efficient foundation language models, 2023a.
  17. Llama 2: Open foundation and fine-tuned chat models, 2023b.
  18. Technical report for stablelm-3b-4e1t. https://stability.wandb.io/stability-llm/stable-lm/reports/StableLM-3B-4E1T--VmlldzoyMjU4?accessToken=u3zujipenkx5g7rtcj9qojjgxpconyjktjkli2po09nffrffdhhchq045vp0wyfo, 2023. Accessed 02-10-2023.
  19. P. Villalobos and D. Atkinson. Trading off compute in training and inference, 2023. URL https://epochai.org/blog/trading-off-compute-in-training-and-inference. Accessed: 2023-9-26.
  20. Smoothquant: Accurate and efficient post-training quantization for large language models, 2023.
  21. Lmsys-chat-1m: A large-scale real-world llm conversation dataset, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Nikhil Sardana (5 papers)
  2. Jonathan Frankle (37 papers)
  3. Jacob Portes (6 papers)
  4. Sasha Doubov (4 papers)
Citations (44)
Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com