Emergent Mind

Abstract

Large language model (LLM) scaling laws are empirical formulas that estimate changes in model quality as a result of increasing parameter count and training data. However, these formulas, including the popular DeepMind Chinchilla scaling laws, neglect to include the cost of inference. We modify the Chinchilla scaling laws to calculate the optimal LLM parameter count and pre-training data size to train and deploy a model of a given quality and inference demand. We conduct our analysis both in terms of a compute budget and real-world costs and find that LLM researchers expecting reasonably large inference demand (~1B requests) should train models smaller and longer than Chinchilla-optimal.

Overview

  • Introduces an approach to scaling laws for LLMs that accounts for inference costs in addition to training costs.

  • Adjusts the Chinchilla scaling laws to propose smaller, longer-trained models for scenarios with high inference demand to optimize computational and financial resources.

  • Analyzes the real-world costs of LLMs considering hardware types, quantization, and utilization differences between training and inference.

  • Recommends a shift in training strategies towards models that are less costly during inference while maintaining quality.

  • Acknowledges the need for validation and exploration of the revised scaling laws' applicability in extreme conditions.

Introduction

LLMs have significantly impacted the field of artificial intelligence, especially in understanding and generating human language. As these models grow larger, it becomes crucial to understand the scaling laws that govern changes in model quality with increases in parameter count and training data. The Chinchilla scaling laws, coined by DeepMind, are a set of empirical formulas that estimate the optimal parameter count and pre-training data size for LLMs. While these have been influential in guiding model training, they focus primarily on training costs, neglecting inference costs, which can be substantial. This paper introduces a new approach to LLM scaling laws that incorporate inference costs to optimize both computational and financial resources.

Computational Optimality

The authors present an adjusted version of the Chinchilla scaling laws that take inference costs into account. They define model quality via cross-entropy loss and computational cost through floating-point operations (FLOPs). Their analysis shows that LLM practitioners expecting substantial inference demand should consider training models that are smaller and trained for longer periods than what would be recommended by Chinchilla laws. This adjusted framework implies that as inference requests increase, the total computational cost changes, skewing towards models that are trained with more data but have fewer parameters.

Estimating Real-World Cost Optimality

Focusing purely on minimizing FLOPs may not align with real-world conditions where various factors, such as hardware utilization and the costs associated with training versus inference, differ significantly. This paper extends the revisions to the Chinchilla scaling laws by including a model for estimating actual costs. The authors consider training and inference on different hardware types, the effects of model quantization before inference, and differences in utilization between training and inference. Real-world cost analysis suggests even greater emphasis on small and long-trained models to reduce inference costs, accounting for significant differences in utilization and costs of training versus inference.

Conclusion

The study culminates in a revised set of scaling laws for LLMs that address both computational efficiency and real-world cost considerations. It argues for a more nuanced approach to model training that considers the lifespan and demand for inference, steering away from training the largest models possible towards more economically optimized solutions. While admitting the need for experimental validation and questioning whether these laws hold in extreme conditions, the authors establish a comprehensive platform for future work in LLM scaling, potentially affecting how future models are developed.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube
References
  1. H. De Vries. Go smol or go home, 2023. https://www.harmdevries.com/post/model-size-vs-compute-overhead/.

  2. Cerebras-gpt: Open compute-optimal language models trained on the cerebras wafer-scale cluster
  3. Gptq: Accurate post-training quantization for generative pre-trained transformers
  4. Training compute-optimal large language models
  5. Scaling laws for neural language models
  6. W. Knight. Openai’s ceo says the age of giant ai models is already over. Wired, 2023. ISSN 1059-1028. https://www.wired.com/story/openai-ceo-sam-altman-the-age-of-giant-ai-models-is-already-over/.

  7. Reducing activation recomputation in large transformer models
  8. L. Labs. Gpu cloud - vms for deep learning. https://lambdalabs.com/service/gpu-cloud, 2023. Accessed 2023-10-02.

  9. Scaling data-constrained language models
  10. NVIDIA. Nvidia a100 datasheet, 2021. https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/a100/pdf/nvidia-a100-datasheet-us-nvidia-1758950-r4-web.pdf.

  11. OpenAI and A. Pilipiszyn. Gpt-3 powers the next generation of apps, Mar 2021. https://openai.com/blog/gpt-3-apps.

  12. Efficiently scaling transformer inference
  13. Scaling language models: Methods, analysis & insights from training gopher
  14. Exploring the limits of transfer learning with a unified text-to-text transformer
  15. N. Shazeer and D. d. Freitas. Introducing character, Dec 2022. https://blog.character.ai/introducing-character/.

  16. Llama: Open and efficient foundation language models, 2023a
  17. Llama 2: Open foundation and fine-tuned chat models, 2023b
  18. Technical report for stablelm-3b-4e1t. https://stability.wandb.io/stability-llm/stable-lm/reports/StableLM-3B-4E1T--VmlldzoyMjU4?accessToken=u3zujipenkx5g7rtcj9qojjgxpconyjktjkli2po09nffrffdhhchq045vp0wyfo, 2023. Accessed 02-10-2023.

  19. P. Villalobos and D. Atkinson. Trading off compute in training and inference, 2023. https://epochai.org/blog/trading-off-compute-in-training-and-inference. Accessed: 2023-9-26.

  20. Smoothquant: Accurate and efficient post-training quantization for large language models
  21. Lmsys-chat-1m: A large-scale real-world llm conversation dataset

Show All 21