Papers
Topics
Authors
Recent
2000 character limit reached

Chinchilla-Style Scaling Practices

Updated 30 November 2025
  • Chinchilla-style scaling practices are empirical methods that balance model size (N) and dataset size (D) to minimize training loss using power-law formulations.
  • They employ robust log-space fitting, precise parameter counting, and hyperparameter tuning to ensure reliable extrapolation across diverse compute regimes.
  • These methods extend to applications in sparse training, code language models, and inference-optimized architectures, demonstrating broad relevance.

Chinchilla-style scaling practices refer to empirical methodologies for allocating compute budgets between model parameter count (NN) and dataset size (DD) when training transformer-based LLMs, with the aim of minimizing loss for a given amount of compute. Originating in Hoffmann et al. (2022) and subsequently rigorously replicated and extended, these practices are now foundational in LLM development. The key insight is that performance scales optimally when NN and DD increase in tandem, typically C1/2\propto C^{1/2} for compute budget CC, yielding a characteristic “tokens-per-parameter” ratio. Precise loss-surface modeling, robust fitting procedures, and careful parameter counting are essential for consistency and reliable extrapolation.

1. Mathematical Formulation and Parametric Loss Laws

Chinchilla-style scaling models the pretraining cross-entropy loss (LL) as a function of model size (NN) and training tokens (DD) using a power-law sum:

L(N,D)=E+ANα+BDβL(N,D) = E + A N^{-\alpha} + B D^{-\beta}

where EE is the irreducible loss floor, and A,BA, B scale the contributions from model size and data size. The exponents α\alpha and β\beta (both 0.33 ⁣ ⁣0.37\approx 0.33\!-\!0.37 in typical natural language settings) are empirically determined through large sweeps and robust loss fitting, employing Huber minimization in log-space and bootstrap resampling for uncertainty quantification (Besiroglu et al., 15 Apr 2024). Robust fitting in log-space, with BFGS optimization and summation (not averaging) of per-point residuals, is essential to obtain valid confidence intervals and reproducible exponents.

Correct parameter estimation yields values such as:

  • A=482.01(±124.58)A = 482.01 (\pm124.58)
  • B=2085.43(±1293.23)B = 2085.43 (\pm1293.23)
  • E=1.8172(±0.03)E = 1.8172 (\pm0.03)
  • α=0.3478(±0.02)\alpha = 0.3478 (\pm0.02)
  • β=0.3658(±0.02)\beta = 0.3658 (\pm0.02)

Thus, at scale:

L(N,D)=1.8172+482.01N0.3478+2085.43D0.3658L(N,D) = 1.8172 + 482.01\,N^{-0.3478} + 2085.43\,D^{-0.3658}

Under a compute budget CC (FLOPs, with C6NDC \approx 6 N D), minimizing L(N,D)L(N,D) subject to $6ND=C$ yields the optimal allocation:

NoptCa,DoptC1aN_{\rm opt} \propto C^{a}, \quad D_{\rm opt} \propto C^{1-a}

with a=βα+β0.513a = \frac{\beta}{\alpha+\beta} \approx 0.513. This enforces a near-constant ratio D/N20D/N \approx 20 at large scale (Besiroglu et al., 15 Apr 2024, Pearce et al., 12 Jun 2024, Hoffmann et al., 2022).

2. Best Practices and Estimation Workflow

Accurate Chinchilla-style scaling requires:

  • Raw data reconstruction: Careful digitization from primary plots, including color-bar sampling and estimation of digitization noise.
  • Log-space robust fitting: Summing robust Huber penalties over residuals, never averaging, and verifying optimizer convergence.
  • Confidence interval reporting: Full bootstrap on fitting objective; meaningful CIs on exponents 0.4\lesssim 0.4 may demand tens to hundreds of thousands of runs.
  • Parameter counting: Always include all parameter counts (non-embedding and embedding layers) (Pearce et al., 12 Jun 2024, Porian et al., 27 Jun 2024).
  • Hyperparameter tuning: Learning rate, batch size, and AdamW β2\beta_2 should be scaled with NN; at small batch size, β2=0.99\beta_2 = 0.99 is recommended (Porian et al., 27 Jun 2024).
  • Architectural calibration: Conditional scaling laws account for hidden size, MLP-to-attention ratio, and grouped-query attention, yielding Pareto-efficient designs with loss and inference cost jointly optimized (Bian et al., 21 Oct 2025).

3. Extensions: Sparse Training, Code LLMs, Data-Constrained Regimes

Sparse Pretraining

Unified scaling laws using average parameter count Nˉ\bar{N} over the pretraining schedule yield accurate loss predictions and inference-time savings (Jin et al., 21 Jan 2025):

L(Nˉ,D)=ANˉα+BDβ+EL(\bar N, D) = A \bar N^{-\alpha} + B D^{-\beta} + E

where Nˉ=(1/T)k=1TNk\bar N = (1/T)\sum_{k=1}^T N_k is the average over pruning iterations. Typical 25%-50%-25% schedule (burn-in, iterative pruning, sparse recovery) achieves near-dense performance with smaller deployable models.

Scaling Laws for Code LLMs

Empirical fits on code datasets show substantially more data-hungry regimes. For code LLMs:

L(N,D)=0.2193+534.374N0.4853+76.0743D0.2983L(N, D) = 0.2193 + 534.374 N^{-0.4853} + 76.0743 D^{-0.2983}

Optimal D/ND/N rises to O(100–300) at large compute, far greater than natural language (\sim20). Mixtures with NL regularize loss at low compute but degrade high-compute, code-centric training (Luo et al., 9 Oct 2025).

Data-Constrained Scaling

In limited-data regimes, the effective data DD' and parameters NN' decay exponentially beyond critical numbers of epochs (RDR_D^*) or parameter excess (RNR_N^*):

D=UD+UDRD(1eRD/RD)D' = U_D + U_D R_D^* \left(1 - e^{-R_D/R_D^*}\right)

N=UN+UNRN(1eRN/RN)N' = U_N + U_N R_N^* \left(1 - e^{-R_N/R_N^*}\right)

Loss plateaus after \sim16 epochs on repeated data, recommending epoch increases over parameter growth up to this threshold (Muennighoff et al., 2023).

4. Recent Developments: Farseer Law and Compute-Centric Scaling

Farseer Scaling Law

Farseer refines Chinchilla by letting the data-scaling exponent and coefficient be explicit functions of NN, yielding better cross-scale extrapolation:

L(N,D)=ea3Nγ+b3+ea2Nβ+b2Dexp(a1Nα+b1)L(N,D) = e^{a_3 N^\gamma + b_3} + e^{a_2 N^\beta + b_2} D^{-\exp(a_1 N^\alpha + b_1)}

This captures nuances missed by fixed-exponent, additive models and predicts rising optimal D/ND/N for extreme budgets (Li et al., 12 Jun 2025).

Unified Compute Scaling

Independent empirical fits show that model performance (measured in bits-per-character) is log-linear in total training compute C=NDC = N D, largely agnostic to specific D/ND/N allocation:

BPC(N,D)=0.031log(ND)+0.572\mathrm{BPC}(N, D) = -0.031 \log(N D) + 0.572

For inference efficiency, minimal NN with D=C/ND = C/N is optimal, subject to downstream quality (Guo, 30 Apr 2024).

5. Inference-Adjusted and Architecture-Aware Scaling

Recent work has incorporated inference demand and architectural constraints into scaling law optimization.

  • Inference penalty: Lifetime compute (training + inference) is minimized by reducing NN and increasing DD for models expected to serve heavy inference loads, often pushing D/N20D/N \gg 20 (Sardana et al., 2023, Bian et al., 30 Jan 2025). Closed-form optimization jointly over NN, DD, and model shape (depth/width).
  • Model shape: Wider and shallower architectures (high hidden-size, low layer count) yield lower latency for fixed accuracy. Pareto-optimal accuracy/latency curves can be traced via penalized loss functions and hardware measurements (Bian et al., 30 Jan 2025, Bian et al., 21 Oct 2025).
Scaling Law Loss Formula Exponent Range Optimal D/N @ Large C Application
Chinchilla E+ANα+BDβE + A N^{-\alpha} + B D^{-\beta} α,β0.33\alpha, \beta \sim 0.33 \sim20 NL LLMs
Chinchilla (Code) E+ANα+BDβE + A N^{-\alpha} + B D^{-\beta} (code values) α0.48,β0.30\alpha \approx 0.48, \beta \approx 0.30 $100$–$300$ Code LLMs
Sparse Chinchilla E+ANˉα+BDβE + A \bar{N}^{-\alpha} + B D^{-\beta} α\alpha, β\beta Fitted Nˉ\bar{N} Sparse training
Farseer ea3Nγ+b3+ea2Nβ+b2Dexp(a1Nα+b1)e^{a_3 N^\gamma + b_3} + e^{a_2 N^\beta + b_2} D^{-\exp(a_1 N^\alpha + b_1)} Flexible \uparrow with CC Cross-scale, ablation
Compute-centric αlog(ND)+β\alpha \log(N D) + \beta α=0.031\alpha=-0.031 Any Efficiency, hardware

6. Caveats, Pitfalls, and Regimes of Validity

  • Parameter counting: Always count all parameters—including embeddings and head—especially at small scale to avoid exponent inflation (Pearce et al., 12 Jun 2024, Porian et al., 27 Jun 2024).
  • Confidence intervals: Robust CIs require large bootstraps; overly narrow intervals frequently indicate misconfigured fitting (e.g., averaging losses) (Besiroglu et al., 15 Apr 2024).
  • Architectural impact: Non-standard shapes, large/small sequence lengths, or alternative modalities may break classic scaling and require retuning (Bian et al., 21 Oct 2025).
  • Extreme regimes: Data-constrained and high-inference applications often deviate from standard Chinchilla ratios, requiring alternate formulations or regime identification.
  • Extrapolation: Fixed-exponent models (Chinchilla) may over- or under-estimate at the boundaries; Farseer or differential piecewise fits are preferred for large-scale prediction.

7. Practical Recipe for Compute-Optimal Training

  1. Determine compute budget CC (FLOPs).
  2. Parametric fitting: Fit L(N,D)L(N,D) from baseline experiments, ensuring full parameter counts and robust loss minimization.
  3. Allocate: For NL LLMs,
    • Nopt=(C6×20)1/2N_{\rm opt} = (\frac{C}{6 \times 20})^{1/2},
    • Dopt=20ND_{\rm opt} = 20 N.
  4. Hyperparameter scaling: Learning rate η=0.022(N/106)0.31\eta=0.022(N/10^6)^{-0.31}, batch size B=6(N/106)0.67B=6 (N/10^6)^{0.67}, AdamW β2=0.99\beta_2=0.99 for small BB.
  5. Architectural optimization: For inference efficiency, select hidden size, MLP/attn ratio, and GQA, employing conditional scaling laws and Pareto analysis.
  6. Sparse/pruned training: Select initial width such that Nˉ\bar{N} matches the desired dense equivalent; schedule prune between 25–75% of budget (Jin et al., 21 Jan 2025).
  7. If code or limited data: Fit code-specific exponents; allocate heavily to data (D/N20D/N \gg 20). In unique-text-starved regimes, repeat up to 16 epochs before returns diminish (Luo et al., 9 Oct 2025, Muennighoff et al., 2023).

Chinchilla-style scaling remains the central paradigm for efficient, predictable LLM training, supporting diverse contexts from sparse models through data-constrained and inference-optimized deployments. Consistency in data handling, parameter accounting, and robust statistical methods is essential for reproducibility and optimal resource utilization (Besiroglu et al., 15 Apr 2024, Pearce et al., 12 Jun 2024, Hoffmann et al., 2022, Li et al., 12 Jun 2025, Luo et al., 9 Oct 2025, Muennighoff et al., 2023, Bian et al., 30 Jan 2025, Porian et al., 27 Jun 2024).

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Chinchilla-Style Scaling Practices.