Compute-Optimal LLMs Provably Generalize Better With Scale
(2504.15208v1)
Published 21 Apr 2025 in cs.LG and cs.AI
Abstract: Why do larger LLMs generalize better? To investigate this question, we develop generalization bounds on the pretraining objective of LLMs in the compute-optimal regime, as described by the Chinchilla scaling laws. We introduce a novel, fully empirical Freedman-type martingale concentration inequality that tightens existing bounds by accounting for the variance of the loss function. This generalization bound can be decomposed into three interpretable components: the number of parameters per token, the loss variance, and the quantization error at a fixed bitrate. As compute-optimal LLMs are scaled up, the number of parameters per data point remains constant; however, both the loss variance and the quantization error decrease, implying that larger models should have smaller generalization gaps. We examine why larger models tend to be more quantizable from an information theoretic perspective, showing that the rate at which they can integrate new information grows more slowly than their capacity on the compute-optimal frontier. From these findings we produce a scaling law for the generalization gap, with bounds that become predictably stronger with scale.
Summary
The paper introduces a novel empirical Freedman inequality that tightens the generalization bound for compute-optimal LLMs.
It demonstrates that increased model scale reduces per-token loss variance, thereby narrowing the generalization gap.
By linking quantization techniques with scaling laws, the study offers practical insights into enhancing LLM compressibility and efficiency.
This paper investigates why larger LLMs tend to generalize better, focusing specifically on models trained in the "compute-optimal" regime described by the Chinchilla scaling laws (2203.15556). The core idea is to develop and analyze generalization bounds for the pretraining objective (next-token prediction negative log likelihood - NLL) that can explain the empirically observed small generalization gap (difference between training and test loss) and its behavior with increasing model scale.
The analysis breaks down the expected test error into three components: irreducible error (inherent randomness in data), approximation gap (training error), and generalization gap (difference between test and train error). The paper focuses on bounding the token-wise generalization gap, building upon prior work (2402.06073) that used martingale theory for non-IID sequence data.
A key contribution is a novel, fully empirical Freedman-type martingale concentration inequality (Theorem 3.1). Unlike standard Azuma-Hoeffding inequalities which depend only on the range of the loss, or Freedman's inequality which depends on the (often unknown) conditional variance, this new bound incorporates a fully empirical variance proxy term, denoted Σ.
Empirical Freedman Inequality (Theorem 3.1): For a martingale difference sequence Xk−E[Xk∣Fk−1], this theorem bounds the average difference:
n1k=1∑n(E[Xk∣Fk−1]−Xk))≤ΔC+ΣC
where C relates to complexity (like log(1/δ)/n), Δ is a bound on the loss difference, and Σ depends on the empirical variation of the sequence relative to a predictable quantity Yk. Crucially, Σ can be estimated directly from training data. When Σ is small, the bound tightens significantly, moving from a C dependence towards a linear C dependence.
This inequality is then incorporated into a generalization bound for LLMs (Theorem 3.4). The bound requires quantizing the model to b bits per parameter to make the hypothesis class finite and relate complexity to model size. It also uses prediction smoothing (mixing model output with a uniform distribution) to bound the maximum possible NLL loss (Δ).
Main Generalization Bound (Theorem 3.4): With probability 1−δ, the token-wise population risk of a smoothed, quantized model (Rsq) is bounded by:
Rsq≤R^h+ClogV+ΣC+2C+(R^q−R^h)
R^h: Empirical training loss of the original model h.
C=(DN)blog2+D1logδ∣K∣: Complexity term. N is parameters, D is tokens. In the Chinchilla regime, N/D is constant (≈1/20). b is bits per parameter for quantization.
logV: NLL of random guessing (Vocabulary size V). ClogV is often the largest term.
Σ: Empirical loss variation term (from Theorem 3.1), computed on training data using the quantized model q.
2C: Cost associated with prediction smoothing.
(R^q−R^h): Quantization gap (difference in training loss between original and quantized model).
The paper evaluates this bound empirically using the Pythia model suite (2304.01373) trained on the Pile dataset (2009.07311), selecting checkpoints along the compute-optimal frontier (N/D≈1/20).
Empirical Findings:
The empirical loss variation term Σ decreases predictably with model size N, roughly as c+kN−0.5. Larger models have lower loss variance per token.
The quantization gap (R^q−R^h) for a fixed bitrate (e.g., b=4 using GPTQ (2210.17323)) also tends to decrease as models get larger.
As N and D increase along the compute-optimal path, the overall bound Rsq decreases, mirroring the decrease in the empirical training loss R^h. This supports the idea that the theory captures the trend of better generalization with scale.
The ClogV term dominates the bound value. Since N/D is constant and b is fixed, this term doesn't inherently decrease with scale in this formulation.
The paper further argues that the complexity term C might actually decrease with scale, making the bound even tighter for larger models. Two arguments are presented:
Improved Quantizability (Appendix B): Drawing on theory from QuIP quantization (2307.13304, 2310.16669), it's argued that if the Hessian spectrum decays rapidly (which empirical estimates suggest), larger models become inherently more quantizable. This means a smaller b could achieve the same quantization gap, reducing C=(N/D)blog2. The analysis involves Hessian incoherence and relies on theoretical quantization schemes (LDLQ with random rotations) that are computationally expensive but illustrate the principle. Empirical estimates of Tr(H1/2) suggest it grows sublinearly with N, supporting the idea that b can decrease with scale.
Information-Theoretic Complexity (Section 5.2): Using prequential coding arguments [Rissanen 1984, Dawid 1984], the information content K(h) of the model h (related to its description length) is upper-bounded by the cumulative reduction in loss during training (area between the initial and final loss curves). For models following scaling laws R(N,D)≈E+AN−α+BD−β, this analysis yields K(h)∝D1−β. Since β<1, the information stored grows sublinearly with the data D. If complexity L(h)∝K(h), then C∝L(h)/D∝D−β. This implies complexity decreases with scale along the compute-optimal path (D∝C0.5). Evaluating the bound with this information-theoretic complexity shows it decreases predictably with scale (O(D−β/2), dominated by the smoothing term), although the absolute values are looser than the quantization-based bounds for current model sizes.
Conclusion: The paper provides theoretical and empirical evidence that LLMs trained compute-optimally generalize better with scale because:
The ratio of parameters to data points (N/D) remains constant.
The per-token loss variance (Σ) decreases.
The models become more compressible/quantizable (effectively smaller b or sublinear K(h)/D), reducing the effective complexity C.
These factors combine within the derived generalization bound to predict a shrinking generalization gap as models scale. The work introduces a novel empirical Freedman inequality applicable to martingales and connects scaling laws, generalization theory, quantization, and information theory to explain a key phenomenon in LLMs. Limitations include the pessimism of the smoothing term and the need for deeper explanations of why loss variance and compressibility behave as observed.