Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
140 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Reconciling Kaplan and Chinchilla Scaling Laws (2406.12907v3)

Published 12 Jun 2024 in cs.LG and cs.CL

Abstract: Kaplan et al. 2020 and Hoffmann et al. 2022 studied the scaling behavior of transformers trained on next-token language prediction. These studies produced different estimates for how the number of parameters ($N$) and training tokens ($D$) should be set to achieve the lowest possible loss for a given compute budget ($C$). Kaplan: $N_\text{optimal} \propto C{0.73}$, Chinchilla: $N_\text{optimal} \propto C{0.50}$. This paper finds that much of this discrepancy can be attributed to Kaplan counting non-embedding rather than total parameters, combined with their analysis being performed at small scale. Simulating the Chinchilla study under these conditions produces biased scaling coefficients close to Kaplan's. Hence, this paper reaffirms Chinchilla's scaling coefficients, by explaining the primary cause of Kaplan's original overestimation. As a second contribution, the paper explains differences in the reported relationships between loss and compute. These findings lead us to recommend that future scaling studies use total parameters and compute.

Citations (5)

Summary

  • The paper reconciles differences between Kaplan’s and Chinchilla’s scaling laws by revising the count of non-embedding parameters.
  • The paper refines the loss model and uses synthetic data simulations to validate the derived scaling coefficients.
  • The paper demonstrates that proper parameter accounting optimizes model training, influencing resource allocation and cost efficiency.

Reconciling Kaplan and Chinchilla Scaling Laws: A Detailed Examination

The paper "Reconciling Kaplan and Chinchilla Scaling Laws" by Tim Pearce and Jinyeop Song addresses a pivotal aspect of the development of LLMs by reconciling discrepancies between the scaling laws proposed by Kaplan et al. (2020) and Hoffmann et al. (2022). Despite both studies focusing on the scaling behavior of transformers trained on next-token language prediction, they arrived at conflicting conclusions regarding the optimal trade-off between the number of parameters (N) and training tokens (D) for a given compute budget (C). This paper meticulously investigates the cause of this discrepancy and reaffirms the scaling coefficients proposed by the Chinchilla paper, demonstrating that Kaplan's overestimation can be attributed to counting non-embedding parameters and conducting their analysis at a smaller scale.

Introduction

The Kaplan and Chinchilla studies had a significant impact on the direction of resource allocation in LLMs. Kaplan's results suggested that "big models may be more important than big data," encouraging a focus on model size. This was reflected in their scaling laws: NoptimalC0.73N_\text{optimal} \propto C^{0.73} and DoptimalC0.27D_\text{optimal} \propto C^{0.27}. In contrast, Chinchilla proposed NoptimalC0.50N_\text{optimal} \propto C^{0.50} and DoptimalC0.50D_\text{optimal} \propto C^{0.50}, urging a shift towards training more modest-sized models on larger datasets. The paper investigates why these two influential studies, both using transformer models, arrived at such different recommendations.

Analysis Process

The reconciliation process is divided into four clear steps:

  1. Parameter Function Fitting: The authors propose a function to relate non-embedding parameters to total parameters, using empirical data from the Chinchilla paper and theoretical insights from the Kaplan paper. They suggest the function NT=NE+γNE1/3N_T = N_{\setminus E} + \gamma N_{\setminus E}^{1/3}, where γ\gamma is a constant derived from model configurations.
  2. Loss Model Modification: They modify the loss function used by Chinchilla to be in terms of non-embedding parameters and compute, acknowledging that embedding parameters are not negligible at small scales.
  3. Scaling Relationship Derivation: The relationship between optimal parameters and compute is analytically derived, showing that scaling coefficients at smaller scales can exhibit different behavior compared to large scales. They derive that a correct interpretation at small scales would lead to a local scaling coefficient in line with Kaplan's findings.
  4. Synthetic Data Simulation: Utilizing the Chinchilla setup, synthetic training curves are generated to fit local power laws and estimate scaling coefficients in the model parameter ranges used by Kaplan. This empirical verification closely aligns with Kaplan's reported coefficients.

Numerical Results and Empirical Verification

The paper provides robust numerical results, demonstrating that using non-embedding parameter counts and smaller model sizes results in scaling coefficients remarkably close to those reported by Kaplan. Specifically, they report coefficients of $0.74$ and $0.78$, compared to Kaplan's $0.73$. This empirical alignment firmly establishes the source of discrepancy between the two studies.

Moreover, the paper includes experimental validation by training small-scale models and verifying the derived scaling coefficients. Such thorough experimental validation strengthens the argument that the choice of parameter counting and the model scale significantly impact the resulting scaling laws.

Implications and Future Directions

This reconciliation has important implications for the design and deployment of LLMs. The findings suggest that both parameter types (embedding and non-embedding) should be considered when studying scaling laws, especially at smaller scales. This reconciled understanding can lead to more efficient resource allocation, potentially reducing the environmental and financial costs of training large models.

From a theoretical perspective, the paper invites further exploration into the relative importance of different parameter types and their specific roles in model performance. Future work could explore the contributions of embedding layers to the broader scaling behavior of LLMs. Additionally, the methodology of using synthetic data to explore scaling laws provides a template for further studies on other aspects of LLM scaling.

Conclusion

The reconciled understanding of scaling laws presented in this paper provides a clearer roadmap for optimizing the trade-offs between model size and data quantity. By accounting for previously overlooked factors, such as the inclusion of embedding parameters and the impact of model scale, the authors offer a nuanced perspective that can guide future research and development in AI.

Youtube Logo Streamline Icon: https://streamlinehq.com