Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws (2401.00448v3)

Published 31 Dec 2023 in cs.LG and cs.CL

Abstract: LLM scaling laws are empirical formulas that estimate changes in model quality as a result of increasing parameter count and training data. However, these formulas, including the popular Deepmind Chinchilla scaling laws, neglect to include the cost of inference. We modify the Chinchilla scaling laws to calculate the optimal LLM parameter count and pre-training data size to train and deploy a model of a given quality and inference demand. We conduct our analysis both in terms of a compute budget and real-world costs and find that LLM researchers expecting reasonably large inference demand (~1B requests) should train models smaller and longer than Chinchilla-optimal. Furthermore, we train 47 models of varying sizes and parameter counts to validate our formula and find that model quality continues to improve as we scale tokens per parameter to extreme ranges (up to 10,000). Finally, we ablate the procedure used to fit the Chinchilla scaling law coefficients and find that developing scaling laws only from data collected at typical token/parameter ratios overestimates the impact of additional tokens at these extreme ranges.

References (21)

Citations (44)

View on Semantic Scholar

Summary

The paper extends existing LLM scaling laws by incorporating inference costs, offering a framework to minimize total computational expenses.
The research demonstrates that under high inference demands, smaller and more extensively trained models achieve significant cost savings.
The study employs a cost function with FLOP estimations and hardware cost adjustments to determine optimal model configurations.

Optimizing LLMs by Accounting for Inference Costs

The paper "Beyond Chinchilla-Optimal: Accounting for Inference in LLM Scaling Laws" examines the limitations of existing LLM scaling laws that predominantly consider training costs to determine optimal model configurations. By extending these scaling laws to incorporate inference costs, it provides a more comprehensive framework for determining LLM configurations that minimize both training and inference costs, especially when models face substantial inference loads.

Introduction to Inference-Adjusted Scaling

Traditional LLM scaling laws, notably the DeepMind Chinchilla laws, focus on balancing parameter count with training tokens to achieve optimal training efficiency. However, they neglect inference costs. Given that models often serve billions of inference requests, ignoring these costs can lead to suboptimal resource allocation. The paper proposes a methodology to integrate inference costs into existing scaling laws, optimizing for realistic usage scenarios where both training and inference demands are significant.

Methodological Advancements

Adjusting for Inference Costs

The authors modify the Chinchilla scaling laws by introducing a cost model that incorporates both training and inference operations. They define a loss function $L(N, D)$ dependent on the number of parameters $N$ and pre-training tokens $D$ , treating pre-training cross-entropy loss as a proxy for model quality. The goal is to find the optimal $N$ and $D$ that minimize the total computational and real-world costs, expressed in FLOPs, subject to a given model quality:

$\text{minimize } \Tf(N, D) + \If(N, D)$

Figure 1: Ratios illustrating the differences in FLOPs, parameters, and pre-training tokens between compute-optimal and Chinchilla-style models across varying inference demands.

Loss Function and Computational Estimation

The adjustment considers inference FLOPs, assuming a standard approximation of 6 FLOPs per parameter for training tokens and 2 for inference tokens. Through computational methods like the Newton root-finding method, they determine optimal parameters and token counts for models at fixed pre-training losses. This approach reveals that practitioners anticipating high inference demands should opt for smaller, more extensively trained models compared to those optimized under purely training-centric scaling laws.

Real-World Cost Optimization

Accounting for Utilization and Hardware Costs

Real-world scenarios demand consideration beyond just efficiency in FLOPs. The paper extends its cost model to include actual monetary costs, reflecting hardware utilization and cost discrepancies between training and inference phases. It incorporates factors such as Model FLOPs Utilization (MFU) and the variance in operational costs on different hardware configurations:

$\text{minimize } \frac{C_{\text{tr}}}{\Utr} \Tf(N, D) + \frac{C_{\text{inf}}}{\Uinp}\If(N, D_{\text{inp}}) + \frac{C_{\text{inf}}}{\Uout}\If(N, D_{\text{out}})$

Figure 2: Cost ratios for models optimized for real-world cost efficiency compared to Chinchilla-style models, showing substantial cost savings for high-demand scenarios.

Practical Implications

The analysis demonstrates significant cost savings for models expecting substantial inference loads when adopting configurations derived from the extended scaling laws. This was evident for an example scenario where a 30B-Chinchilla-quality model's cost was reduced by 17% by training a 16B model on a larger dataset.

Conclusion

This research provides an enhanced understanding of scaling laws considering both training and inference costs. The findings suggest that for high-demand models, smaller models trained longer might be preferable, counter to the Chinchilla predictions. Future work is needed to experimentally validate this theoretical framework and assess its applicability across broader ranges. The integration of inference considerations into LLM scaling laws sets a new standard for optimizing both computational efficiency and cost-effectiveness in deploying LLMs.