Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 37 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 10 tok/s Pro
GPT-5 High 15 tok/s Pro
GPT-4o 84 tok/s Pro
Kimi K2 198 tok/s Pro
GPT OSS 120B 448 tok/s Pro
Claude Sonnet 4 31 tok/s Pro
2000 character limit reached

Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws (2401.00448v3)

Published 31 Dec 2023 in cs.LG and cs.CL

Abstract: LLM scaling laws are empirical formulas that estimate changes in model quality as a result of increasing parameter count and training data. However, these formulas, including the popular Deepmind Chinchilla scaling laws, neglect to include the cost of inference. We modify the Chinchilla scaling laws to calculate the optimal LLM parameter count and pre-training data size to train and deploy a model of a given quality and inference demand. We conduct our analysis both in terms of a compute budget and real-world costs and find that LLM researchers expecting reasonably large inference demand (~1B requests) should train models smaller and longer than Chinchilla-optimal. Furthermore, we train 47 models of varying sizes and parameter counts to validate our formula and find that model quality continues to improve as we scale tokens per parameter to extreme ranges (up to 10,000). Finally, we ablate the procedure used to fit the Chinchilla scaling law coefficients and find that developing scaling laws only from data collected at typical token/parameter ratios overestimates the impact of additional tokens at these extreme ranges.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (21)
  1. H. De Vries. Go smol or go home, 2023. URL https://www.harmdevries.com/post/model-size-vs-compute-overhead/.
  2. Cerebras-gpt: Open compute-optimal language models trained on the cerebras wafer-scale cluster, 2023.
  3. Gptq: Accurate post-training quantization for generative pre-trained transformers, 2023.
  4. Training compute-optimal large language models, 2022.
  5. Scaling laws for neural language models, 2020.
  6. W. Knight. Openai’s ceo says the age of giant ai models is already over. Wired, 2023. ISSN 1059-1028. URL https://www.wired.com/story/openai-ceo-sam-altman-the-age-of-giant-ai-models-is-already-over/.
  7. Reducing activation recomputation in large transformer models, 2022.
  8. L. Labs. Gpu cloud - vms for deep learning. https://lambdalabs.com/service/gpu-cloud, 2023. Accessed 2023-10-02.
  9. Scaling data-constrained language models, 2023.
  10. NVIDIA. Nvidia a100 datasheet, 2021. URL https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/a100/pdf/nvidia-a100-datasheet-us-nvidia-1758950-r4-web.pdf.
  11. OpenAI and A. Pilipiszyn. Gpt-3 powers the next generation of apps, Mar 2021. URL https://openai.com/blog/gpt-3-apps.
  12. Efficiently scaling transformer inference, 2022.
  13. Scaling language models: Methods, analysis & insights from training gopher, 2022.
  14. Exploring the limits of transfer learning with a unified text-to-text transformer, 2023.
  15. N. Shazeer and D. d. Freitas. Introducing character, Dec 2022. URL https://blog.character.ai/introducing-character/.
  16. Llama: Open and efficient foundation language models, 2023a.
  17. Llama 2: Open foundation and fine-tuned chat models, 2023b.
  18. Technical report for stablelm-3b-4e1t. https://stability.wandb.io/stability-llm/stable-lm/reports/StableLM-3B-4E1T--VmlldzoyMjU4?accessToken=u3zujipenkx5g7rtcj9qojjgxpconyjktjkli2po09nffrffdhhchq045vp0wyfo, 2023. Accessed 02-10-2023.
  19. P. Villalobos and D. Atkinson. Trading off compute in training and inference, 2023. URL https://epochai.org/blog/trading-off-compute-in-training-and-inference. Accessed: 2023-9-26.
  20. Smoothquant: Accurate and efficient post-training quantization for large language models, 2023.
  21. Lmsys-chat-1m: A large-scale real-world llm conversation dataset, 2023.
Citations (44)

Summary

  • The paper extends existing LLM scaling laws by incorporating inference costs, offering a framework to minimize total computational expenses.
  • The research demonstrates that under high inference demands, smaller and more extensively trained models achieve significant cost savings.
  • The study employs a cost function with FLOP estimations and hardware cost adjustments to determine optimal model configurations.

Optimizing LLMs by Accounting for Inference Costs

The paper "Beyond Chinchilla-Optimal: Accounting for Inference in LLM Scaling Laws" examines the limitations of existing LLM scaling laws that predominantly consider training costs to determine optimal model configurations. By extending these scaling laws to incorporate inference costs, it provides a more comprehensive framework for determining LLM configurations that minimize both training and inference costs, especially when models face substantial inference loads.

Introduction to Inference-Adjusted Scaling

Traditional LLM scaling laws, notably the DeepMind Chinchilla laws, focus on balancing parameter count with training tokens to achieve optimal training efficiency. However, they neglect inference costs. Given that models often serve billions of inference requests, ignoring these costs can lead to suboptimal resource allocation. The paper proposes a methodology to integrate inference costs into existing scaling laws, optimizing for realistic usage scenarios where both training and inference demands are significant.

Methodological Advancements

Adjusting for Inference Costs

The authors modify the Chinchilla scaling laws by introducing a cost model that incorporates both training and inference operations. They define a loss function L(N,D)L(N, D) dependent on the number of parameters NN and pre-training tokens DD, treating pre-training cross-entropy loss as a proxy for model quality. The goal is to find the optimal NN and DD that minimize the total computational and real-world costs, expressed in FLOPs, subject to a given model quality:

$\text{minimize } \Tf(N, D) + \If(N, D)$ Figure 1

Figure 1

Figure 1

Figure 1: Ratios illustrating the differences in FLOPs, parameters, and pre-training tokens between compute-optimal and Chinchilla-style models across varying inference demands.

Loss Function and Computational Estimation

The adjustment considers inference FLOPs, assuming a standard approximation of 6 FLOPs per parameter for training tokens and 2 for inference tokens. Through computational methods like the Newton root-finding method, they determine optimal parameters and token counts for models at fixed pre-training losses. This approach reveals that practitioners anticipating high inference demands should opt for smaller, more extensively trained models compared to those optimized under purely training-centric scaling laws.

Real-World Cost Optimization

Accounting for Utilization and Hardware Costs

Real-world scenarios demand consideration beyond just efficiency in FLOPs. The paper extends its cost model to include actual monetary costs, reflecting hardware utilization and cost discrepancies between training and inference phases. It incorporates factors such as Model FLOPs Utilization (MFU) and the variance in operational costs on different hardware configurations:

$\text{minimize } \frac{C_{\text{tr}}}{\Utr} \Tf(N, D) + \frac{C_{\text{inf}}}{\Uinp}\If(N, D_{\text{inp}}) + \frac{C_{\text{inf}}}{\Uout}\If(N, D_{\text{out}})$ Figure 2

Figure 2

Figure 2

Figure 2: Cost ratios for models optimized for real-world cost efficiency compared to Chinchilla-style models, showing substantial cost savings for high-demand scenarios.

Practical Implications

The analysis demonstrates significant cost savings for models expecting substantial inference loads when adopting configurations derived from the extended scaling laws. This was evident for an example scenario where a 30B-Chinchilla-quality model's cost was reduced by 17% by training a 16B model on a larger dataset.

Conclusion

This research provides an enhanced understanding of scaling laws considering both training and inference costs. The findings suggest that for high-demand models, smaller models trained longer might be preferable, counter to the Chinchilla predictions. Future work is needed to experimentally validate this theoretical framework and assess its applicability across broader ranges. The integration of inference considerations into LLM scaling laws sets a new standard for optimizing both computational efficiency and cost-effectiveness in deploying LLMs.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com