Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 175 tok/s
Gemini 2.5 Pro 54 tok/s Pro
GPT-5 Medium 38 tok/s Pro
GPT-5 High 37 tok/s Pro
GPT-4o 108 tok/s Pro
Kimi K2 180 tok/s Pro
GPT OSS 120B 447 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

Predictive Power of Pre-training Loss

Updated 21 October 2025
  • The paper reveals that raw pre-training loss, uncorrected for scaling, fails as a reliable predictor of generalization in deep networks.
  • Layerwise normalization of weights disentangles capacity effects, yielding a nearly linear relationship between training and test cross-entropy losses across datasets.
  • The findings provide practical guidance by demonstrating that normalized loss metrics offer tight generalization bounds and robust model selection criteria.

The predictive power of pre-training loss concerns the degree to which the loss measured during or after pre-training of neural (particularly deep) networks anticipates a model's generalization, downstream performance, or transferability. While the intuitive assumption is that lower pre-training loss signals better future performance, extensive empirical and theoretical investigation has revealed subtleties in this relationship—most notably, the impact of normalization, loss decomposition, capacity scaling, and architecture-specific properties on the interpretability and reliability of pre-training loss as a predictor.

1. Decomposition of Cross-Entropy Loss and the Role of Capacity

A central finding is that, in deep networks trained with cross-entropy or other exponential-type losses, the raw pre-training loss can be misleading due to positive homogeneity properties and scaling effects. For deep ReLU networks, the function ff can be decomposed as

f(W1,,WK;x)=(k=1Kρk)f(W~1,,W~K;x)f(W^1,\dots,W^K; x) = \left( \prod_{k=1}^K \rho_k \right) f(\widetilde{W}^1,\dots, \widetilde{W}^K; x)

with each Wk=ρkW~kW_k = \rho_k \widetilde{W}^k and W~k=1\Vert \widetilde{W}^k\Vert = 1. This scaling invariance leaves the predicted class unchanged (as classification depends only on the sign), but, due to the exponential sensitivity of the cross-entropy objective, allows the training loss LL to decrease arbitrarily—regardless of the actual separation (or margin) between classes:

L=nln[1+exp(ynf(xn))]=nln[1+exp(yn(k=1Kρk)f(W~;xn))]L = \sum_{n} \ln\left[1 + \exp(-y_n f(x_n)) \right] = \sum_{n} \ln\left[1 + \exp\left(-y_n (\prod_{k=1}^K \rho_k) f(\widetilde{W}; x_n) \right) \right]

Thus, overparameterization or capacity scaling can artificially drive down training loss, invalidating its direct use as a generalization predictor.

2. Layerwise Normalization and the Linear Relationship

To address these scaling artifacts, the introduction of layerwise normalization (scaling each WkW^k to unit norm) was shown to "factor out" capacity-related contributions, yielding a normalized pre-training loss that is highly predictive of test loss. Once normalized, a tight linear (often near the identity) relationship emerges between training and test cross-entropy losses. Experimental evidence on datasets including CIFAR-10, CIFAR-100, and MNIST demonstrates regression slopes close to one (with small intercepts) and adjusted R2R^2 close to unity.

Normalization Training \rightarrow Test Loss Relationship Robustness
None Weak, capacity-dependent, non-monotonic Not robust
Layerwise Strong, nearly linear with slope 1\approx 1 Robust across
datasets/norm

This result holds across different initializations (including Gaussian variance and corrupted pretraining), architectures, and norm choices (Frobenius, L1L_1).

3. Generalization Bounds and Theoretical Implications

With normalized loss, classical generalization bounds regain practical tightness; empirically, the difference ES()E()|E_S(\ell) - E(\ell)| is small. For cross-entropy loss with normalization,

E()ES()c1Complexity(f)+c2ln(1/δ)2N|E(\ell) - E_S(\ell)| \leq c_1 \cdot \text{Complexity}(f) + c_2 \sqrt{\frac{\ln(1/\delta)}{2N}}

where the complexity is a function of normalized weights and NN is the sample size. Thus, normalized pre-training loss provides a near-direct estimator of expected loss.

Theoretical analysis extends to the use of ψ\psi-transforms to connect excess risk (R(f)R)(R(f) - R^*) to normalized cross-entropy loss, for example:

ψ(x)=11x2\psi(x) = 1 - \sqrt{1 - x^2}

where ψ\psi provides a lower bound (though empirical error tracks the bound closely without attaining it exactly).

4. Empirical Evidence on Prediction and Monotonicity

Empirical investigation shows that:

  • Pre-training on corrupted labels (introducing label noise before retraining on clean data) creates a family of models with the same final (unnormalized) loss but widely varying test performance.
  • Differently scaled initializations also yield a spectrum of generalization at the same training loss.
  • After normalization, not only do training and test losses align almost perfectly, but the test classification error becomes an approximately monotonic function of normalized loss (i.e., lower normalized loss predicts lower error across the solution space).

These findings are robust: the alignment between normalized training and test loss persists across models, initializations, optimizers, and datasets. The correlation between the product of layerwise norms (interpreted as a capacity surrogate) and test loss further underscores that unnormalized loss mainly reflects network scaling, not decision rule quality.

5. Practical Guidance and Implications for Model Selection

For practitioners, these results emphasize that:

  • Raw training loss should not be solely relied upon for model selection, early stopping, or comparison, as it is susceptible to scaling manipulation.
  • Layerwise-normalized training loss is a reliable and empirically validated indicator of test loss, supporting more robust model evaluation and selection.
  • Measures of complexity or model capacity based on products of layer norms appear theoretically justified as indicators of generalization (even in overparameterized networks).

The insights have consequences for interpreting implicit regularization effects of SGD, the significance of flat minima (as some practices affect the effective scale of solutions), and the interpretation of architectural or initialization choices.

6. Broader Perspectives and Connections

The normalization principle articulated here has been extended and connected to more recent research. For example, subsequent works have examined the relationship between flat minima and transferability (Liu et al., 2022), curriculum effects of data corruption or initialization, and extensions to Bayesian and probabilistic perspectives on predictive uncertainty (Shwartz-Ziv et al., 2022). The findings continue to motivate new loss metrics and normalization strategies for capacity-independent evaluation of deep models, with the normalized pre-training loss increasingly used as an anchor for generalization analysis, scaling law extrapolation, and resource allocation in large-scale training.

7. Summary

The predictive power of pre-training loss depends critically on proper normalization. Without correcting for the exponential scaling freedom of deep architectures, raw cross-entropy loss is not a reliable indicator of generalization. Normalized training loss, obtained via systematic layerwise norm scaling, exhibits a robust linear relationship with test loss and enables rigorous, tight generalization bounds—guiding both theoretical understanding and practical procedures for model selection, training monitoring, and analysis of deep neural networks (Liao et al., 2018).

Key formula:

f(W;x)=(k=1Kρk)f(W~;x),Wk=ρkW~k,W~k=1f(W;x) = \left(\prod_{k=1}^K \rho_k \right) f(\widetilde{W}; x),\quad W_k = \rho_k \widetilde{W}_k,\quad \|\widetilde{W}_k\| = 1

L=nln[1+exp(yn(k=1Kρk)f(W~;xn))],L = \sum_n \ln\left[1 + \exp\left(-y_n \left(\prod_{k=1}^K \rho_k\right) f(\widetilde{W}; x_n)\right)\right],

with normalization (ρk=1)(\rho_k = 1) yielding LtrainLtestL_{\text{train}} \approx L_{\text{test}}. This framework underlies the modern view that, after normalization, pre-training loss is a strong predictor of generalization and test performance in deep networks.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Predictive Power of Pre-training Loss.