Generalization Error in Machine Learning

Updated 10 November 2025

Generalization error is the gap between population risk and training risk, quantifying how well a model performs on new data.
It is analyzed using capacity measures, mutual information bounds, and stability assessments to control overfitting and ensure robustness.
Recent advances address overparameterization, double descent, and domain generalization, offering deeper insights into high-dimensional learning.

Generalization error (GE) is the fundamental statistical quantity quantifying the ability of a learned model or predictor to perform well on previously unseen data drawn from the same or related distributions, beyond the training sample. GE is central to statistical learning theory and modern machine learning, as it both measures and bounds the risk of overfitting and is the key target for algorithm design, evaluation, and theoretical analysis. As research has advanced from classical linear models to high-dimensional and deep-learning regimes, the rigorous characterization of GE—via both explicit formulas and upper bounds—has become a multifaceted field leveraging capacity measures, information theory, algorithmic stability, spectral properties, and empirical resampling.

1. Definitions, Formalism, and Alternative Notions

The generalization error is typically defined in terms of the difference between population (expected) risk and empirical (training) risk: $\mathrm{GE}(f) = |\, R_{\mathrm{pop}}(f) - R_{\mathrm{emp}}(f) \,|$ where for a hypothesis $f$ , the population risk is

$R_{\mathrm{pop}}(f) = \mathbb{E}_{(x,y)\sim\mathcal{D}}[\ell(y, f(x))]$

and the empirical risk is

$R_{\mathrm{emp}}(f) = \frac{1}{n}\sum_{i=1}^n \ell(y_i, f(x_i))$

for i.i.d. data $\{(x_i, y_i)\}_{i=1}^n \sim \mathcal{D}$ .

Multiple, non-equivalent versions of GE have been identified (Laber et al., 2018):

The population-optimal GE, $M(c^{\rm opt})$ , incurred by the best possible classifier in the class $\mathcal{C}$ .
The conditional GE, $M(\widehat{c}_n)$ , of the fitted rule on fresh data.
The algorithmic or expected GE, $M_n(\Gamma) = \mathbb{E}[M(\widehat{c}_n)]$ , averaging over training set randomness. These functionals need not agree, especially when the classification boundary is nonregular or in the presence of data-dependent estimators.

2. Theoretical Bounds, Key Principles, and Rates

Classical and modern learning theory provide several families of generalization error guarantees and upper bounds. Some of the most prominent frameworks include:

Capacity-Based and Margin Norm Bounds

VC-Dimension: For hypothesis class $\mathcal{H}$ with VC-dimension $d$ ,

$R_{\mathrm{pop}}(f) \leq R_{\mathrm{emp}}(f) + O\left( \sqrt{\frac{d + \ln(1/\delta)}{n}} \right)$

(Jakubovitz et al., 2018).

Rademacher Complexity: Replaces VC-dimension with a data-dependent complexity, yielding bounds of similar form.

Information-Theoretic (Mutual Information) Bounds

Sample Mutual Information: Given sub-Gaussian loss,

$|\mathbb{E}[\mathrm{GE}]| \leq \frac{1}{n} \sum_{i=1}^n \sqrt{2 \sigma^2 I(W; Z_i)}$

with $I(W;Z_i)$ the mutual information between the algorithm output and each datum (Wu et al., 2023, Pensia et al., 2018).

Fast-Rate Conditions: When the excess risk is sub-Gaussian or satisfies the $(\eta,c)$ -central condition, information-theoretic rates can improve to $O(1/n)$ , matching parametric rates in favorable regimes (Wu et al., 2023).

Algorithmic Stability

Uniform Stability: Algorithms with uniform stability $\beta$ guarantee

$\mathrm{GE} \leq 2\beta + O\left(\sqrt{\frac{1}{n}}\right)$

(Jakubovitz et al., 2018).

SGD's stochasticity endows it with implicit regularization, decorrelating successive updates, and can be analyzed using the difference in the covariance of gradients, yielding

$E[\Delta \mathrm{GE}] \approx \frac{\eta}{|B|} \operatorname{tr} \Sigma_g$

(Roberts, 2021).

PAC-Bayesian and Compression Bounds

PAC-Bayes: Risk differences scale with KL divergence between posterior and prior:

$KL(\mathrm{emp\_risk} \parallel \mathrm{pop\_risk}) \leq \frac{KL(Q \parallel P) + \ln[(n+1)/\delta]}{n}$

(Jakubovitz et al., 2018).

Compression: If a network can be compressed to small code length while maintaining low training error, GE can be sharply controlled (Jakubovitz et al., 2018).

Spectral and Invariance-Based Bounds

The spectral profile of kernel algorithms directly determines GE; for example, in kernel ridge regression and gradient descent, GE can be written as a quadratic functional of the spectrum, with localization phenomena at certain scales (KRR saturation) and explicit asymptotics under power-law eigenvalue decay (Velikanov et al., 18 Mar 2024).
For invariant classifiers, GE is proportional to the base-space complexity rather than the input-space complexity, often yielding a reduction by $\sqrt{T}$ , where $T$ is the symmetry group order (Sokolic et al., 2016).

3. Information-Theoretic and Algorithm-Dependent GE Analyses

Recent research incorporates information-theoretic quantities and algorithmic specifics to derive tight, data-dependent generalization bounds:

Noisy, Iterative Algorithms: For SGLD and related stochastic gradient methods with noisy iterations, mutual information between dataset and output admits a bound scaling as

$|\mathrm{GE}| \leq \sqrt{ \frac{2R^2}{n} \sum_{t=1}^T \frac{d}{2} \log\left(1 + \frac{\eta_t^2 L^2}{d\sigma_t^2}\right) }$

(Pensia et al., 2018). This approach is robust to pathwise output selection and Markovian (even non-uniform) sampling schedules. For fully convex or bounded-Lipschitz loss, these bounds can match the optimal $O(1/n)$ rates (Li et al., 2019).

Method of Gaps: The GE is expressed as an average "gap" (variation in risk expectation) between two measures, leading to closed-form expressions in terms of KL divergences, mutual information, and relative entropies, unifying PAC-Bayes, mutual information, and other known formulas (Perlaza et al., 18 Nov 2024). For Gibbs posteriors, the method reproduces exact mutual- and lautum-information-based formulas.

4. Empirical Estimation and Practical Approaches

GE estimation in practical settings requires resampling techniques, with nontrivial modifications for non-i.i.d. data:

Clustered, Spatial, and Streaming Data: Standard random cross-validation may yield biased GE estimates under dependent or structured data. Tailored approaches—grouped CV for clusters, spatial blocks or buffered holdout for spatial/temporal data, Horvitz-Thompson correction for sampling weights, stratification for hierarchical outcomes—restore unbiasedness, validated empirically by reductions of 30–50% bias relative to conventional approaches (Hornung et al., 2023).
Ensembles: For ensembles, analytic formulas for GE and its bias-variance decomposition can be estimated via parametric modeling of base classifier score distributions (Normal/Beta) at each point in feature space, and used to optimize ensemble size and bite subsample size (Mahajan et al., 2017).
Input Compression and Infinite-Width DNNs: For deep networks in the NTK regime, the generalization error can be empirically upper-bounded by a term involving the mutual information between inputs and final-layer representations:

$|\mathrm{GE}| \leq \sqrt{ \frac{2}{n} \left( I(X;Z) + \log(2/\delta) \right) }$

where $I(X;Z)$ can be computed in closed-form in the GP/NTK limit (Galloway et al., 2022).

5. Modern Phenomena: Overparameterization, Double Descent, and Structure

A comprehensive characterization of GE in the high-dimensional and overparameterized regime reveals several regimes and sharp transitions:

Overfitting Peak and Double Descent: In interpolating models, GE diverges at the interpolation threshold (e.g., when number of parameters equals sample size), but may subsequently decrease as overparameterization increases—this is the mathematical "double descent" (Mitra, 2019, Emami et al., 2020). Precise risk formulas show that the location and nature of the overfitting spike and subsequent regimes depend intricately on penalty (e.g., $\ell_1$ vs. $\ell_2$ ), noise, and latent model sparsity.
Spectral Saturation and Universality: In kernel methods, GE is governed by the spectral profile of the data and target, with phenomena such as KRR saturation (failure to achieve minimax rates when spectrum/target is too "smooth" relative to noise) and universality (identical GE curves for a wide class of kernels/structures in the large-sample, noisy regime) (Velikanov et al., 18 Mar 2024).
Effect of Regularization, Loss, and Initialization: The structure of GE formulas highlights the critical importance of regularization tuning, loss selection, and, to a lesser extent, initialization. For instance, in generalized linear models, ridge parameter and data-to-dimension ratio ( $\beta$ ) precisely determine GE phase transitions, and overparameterization with no regularization can be optimal in high-SNR regimes (Emami et al., 2020).
Margin, Norm, and Architectural Effects: For neural networks, tighter GE is associated with large margin, small product of spectral norms, and network compressibility (Jakubovitz et al., 2018). PAC-Bayes and spectral margin analyses show margin and depth act multiplicatively, and that algorithmic procedures (e.g., SGD) induce effective regularization and flatness in loss landscapes (Roberts, 2021).

6. Extensions, Open Problems, and Limitations

Nonregularity and Confidence Sets: For classifiers with discontinuous loss functionals (e.g., 0-1 loss), plug-in and bootstrap CI procedures may be invalid. Data-adaptive projection and bound-based procedures are needed for valid GE inference (Laber et al., 2018).
Unsupervised and Domain Generalization: In unsupervised settings, GE decomposes into model and data error, with no direct analog of bias-variance; the optimal capacity is driven by the data's intrinsic complexity (Kim et al., 2023). Domain generalization introduces further challenges, but kernel-based bounds with cross-domain Rademacher complexity can be derived, with GE scaling logarithmically in the number of classes (Deshmukh et al., 2019).
Robustness and Compression: Robustness-based bounds tie GE to local Lipschitz properties or output sensitivity, while deep architectural properties such as invariance and compression ratio yield alternative axes of control.
Open Problems: No existing theory completely characterizes the coexistence of memorization and generalization in overparameterized deep networks, nor fully unites worst-case (adversarial), data-dependent, and average-case GE (Jakubovitz et al., 2018). Extensions to generative models, multi-stage decision processes, and non-i.i.d. covariate shifts remain active research areas.

In summary, generalization error is the central quantitative target governing both theoretical and applied machine learning. Its rigorous paper underpins not only the validity of deployed models, but also the modern understanding of learning algorithms, especially as overparameterization, noise, regularization, and data geometry interact in complex ways. Recent work has unified multiple strands—spectral, information-theoretic, stability-based, and resampling—into a comprehensive, if still evolving, mathematical theory of generalization.