Minimum-Norm Interpolating Estimator

Updated 10 February 2026

Minimum-norm interpolating estimator is a method that chooses the exact data fit with the smallest norm, ensuring minimal complexity and controlled smoothness.
It reveals surprising generalization properties such as benign overfitting and double-descent risk curves in overparameterized settings like linear regression and RKHS.
The estimator bridges implicit bias and explicit regularization, providing insights into risk minimization and model stability across diverse functional spaces.

A minimum-norm interpolating estimator is a solution to an interpolation problem where, among all possible fits that exactly match observed data, one selects the candidate with minimal norm according to a specified geometry. These estimators arise across a range of functional and statistical settings, including classical overparameterized linear regression, kernel methods in reproducing kernel Hilbert spaces (RKHS), Banach spaces, and variational function extension problems. Rigorous analysis of such estimators reveals both their surprising generalization properties—such as benign overfitting and double/multiple-descent risk curves—and their structural role as the unique representers of implicit regularization in overparameterized regimes.

1. Formal Definition and General Framework

Given observations $(x_i, y_i)_{i=1}^n$ , a function space $\mathcal{F}$ , and a norm $\|\cdot\|$ , the minimum-norm interpolator $\hat f$ is defined as: $\hat f = \operatorname*{arg\,min}_{f \in \mathcal{F}} \|f\| \quad \text{subject to} \quad f(x_i) = y_i\,, \; i = 1,\ldots, n.$ In the finite-dimensional setting (e.g., the linear model $Y = X\beta^* + \xi$ with $X \in \mathbb{R}^{n \times p}$ , $p \gg n$ ), this becomes minimization of the Euclidean, $\ell_1$ , or another norm over the solution set of $X\beta = Y$ . In functional data settings—including Sobolev, RKHS, and Banach interpolants—corresponding norms enforce smoothness or structural simplicity (Chinot et al., 2020, Rangamani et al., 2020, Li, 2020, Herbert-Voss et al., 2014, Chandrasekaran et al., 2017).

Key explicit forms include:

Linear, $\ell_2$ case: $\hat\beta = X^+(Y) = X^\top (XX^\top)^{-1} Y$ .
RKHS case: $\hat f(x) = k(x,X)K^{-1}y$ where $K_{ij} = k(x_i,x_j)$ . The function $\hat f$ uniquely minimizes $\|f\|_{\mathcal{H}}$ among all interpolants (Rangamani et al., 2020, Li, 2020).
Minimum weighted norm / Sobolev extension: Interpolants minimize a seminorm associated with derivative control or spectral weights, often yielding smooth, stable extensions (Herbert-Voss et al., 2014, Chandrasekaran et al., 2017).

2. Theoretical Properties: Bias–Variance, Generalization, and Risk Bounds

Linear Regression

In high-dimensional regression, the $\ell_2$ minimum-norm interpolator achieves, with high probability,

$\|\Sigma^{1/2}(\hat\beta - \beta^*)\|_2^2 \leq \frac{\|\beta^*\|_2^2 r_{cn}(\Sigma) \vee \|\xi\|_2^2}{n}$

where $r_k(\Sigma) = \sum_{i\geq k} \lambda_i(\Sigma)$ is the sum of trailing eigenvalues ("effective dimension") (Chinot et al., 2020, Chinot et al., 2020, Lecué et al., 2022). This decomposition reflects a phase transition:

High signal-to-noise: The "bias term" dominates, often decaying rapidly with the spectrum.
Low signal-to-noise: The "variance term" dominates, saturating at $\|\xi\|_2^2 / n$ ; overfitting noise is "benign" and the prediction error is comparable to the irreducible noise floor.

Analogous bounds hold for different norms and problem structures with respective dependencies, e.g., logarithmic (for $\ell_1$ under sparsity) or group/block sizes (group Lasso) (Chinot et al., 2020, Wang et al., 2021, Li et al., 2021).

RKHS and Nonparametric Regression

For kernel interpolation, the minimum-norm interpolator minimizes RKHS norm, enjoys optimality properties for leave-one-out stability, and delivers generalization rates via stability risk conversions: $\E_S [ I[f^*] - \inf_{f \in \mathcal{H}} I[f] ] \leq \beta_{CV}$ where $\beta_{CV}$ is minimized for the minimum-norm solution and is controlled by the condition number of the kernel matrix (Rangamani et al., 2020, Li, 2020, Liang et al., 2021). Associated risk curves in high-dimension can exhibit "double-descent" or even multiple-descent due to phase transitions in random matrix spectra (1908.10292, Rangamani et al., 2020).

Factor Models and Model Structure

If $X$ and $Y$ are generated by a low-rank or factor structure, explicit risk decompositions show that the minimum-norm interpolator can achieve excess risk near the oracle benchmark, provided the effective rank of $\Sigma_X$ is less than $n$ and the signal loading is strong. In contrast, in high effective-rank ("junk features") regime, the interpolator's risk approaches the null predictor (Bunea et al., 2020, Mahdaviyeh et al., 2019).

3. Geometry, Universality, and Self-Induced Regularization

A robust geometric interpretation separates signal and noise directions in feature space:

The estimator decomposes into a ridge (regularized) estimator in the leading eigenspaces and an overfitting component on the residual subspace (Lecué et al., 2022).
"Self-induced regularization" arises because the solution must interpolate in-sample noise in a high-dimensional, low-spectral-density subspace; the effective degrees of freedom and estimation error are governed by the spectral decay of the covariance matrix.
The phenomena and bounds proved are universal across Gaussian and heavy-tailed designs (requiring only $\log n$ moments) owing to high-dimensional concentration results and generalizations of the Dvoretsky–Milman theorem.

Benign overfitting: Provided the spectrum is appropriately "spiked" or decays rapidly, the overfitting component's contribution vanishes asymptotically ("benign overfitting") (Lecué et al., 2022, Mahdaviyeh et al., 2019, Chinot et al., 2020).

4. Extensions, Regularization, and Implicit Bias

Explicit and Implicit Regularization

Explicit regularization: Adding vanishing $\ell_2$ penalties to empirical risk minimization enforces convergence to minimum-norm interpolants, as rigorously shown for wide two-layer ReLU neural networks (Park et al., 2023). Exact scaling results dictate the required vanishing rate of weight decay.
Implicit regularization: Even in the absence of any explicit penalty, gradient descent and variants (SGD, momentum) initialized appropriately frequently converge to the minimum-norm or minimum-Barron-seminorm interpolant in function space (Park et al., 2023, Li, 2020). This phenomenon is observed both theoretically (via $\Gamma$ -convergence) and empirically.

Algorithmic Implications and Batch Partitioning

Naïve minimum-norm interpolation in linear regression can suffer from singularities and double-descent near $p/n = 1$ (interpolation threshold). Batch-based correction (as in the batch minimum-norm estimator) regularizes this behavior, eliminates the double-descent, and introduces stable risk curves that are monotonic in the overparameterization ratio (Ioushua et al., 2023).

5. Consistency, Limitations, and Practical Considerations

Consistency and Optimality

For $\ell_2$ interpolation in low effective rank or factor models, asymptotic consistency is achievable.
For $\ell_1$ penalized interpolation under sparsity and isotropic design, sharp matching upper and lower bounds of order $\sigma^2 / \log(d/n)$ are obtained, implying that consistency requires vanishing noise faster than $1 / \log(d/n)$ as $d/n \to \infty$ (Wang et al., 2021).

Uniform Convergence

Classic uniform convergence over norm balls does not explain the consistency of minimum-norm interpolation in the overparameterized regime. However, uniform convergence over the set of zero error predictors with bounded norm suffices and explains the observed generalization behavior (Zhou et al., 2020).

Non-Optimality and Alternatives

Although the minimum-norm interpolator is optimal in specific senses (e.g., smallest norm among interpolants), it is generally suboptimal when population information is available. Alternative interpolators—optimized for population risk conditional on known or estimable model structure and noise—can provably outperform minimum-norm solutions, especially in pathological spectral regimes (Oravkin et al., 2021).

6. Summary Table: Key Instances of Minimum-Norm Interpolants

Problem Setting	Solution Definition/Formula	Core Generalization Property
Linear least-squares ( $\ell_2$ )	$\hat\beta = X^+Y$	Benign overfitting if effective rank is low
RKHS/kernel methods	$\hat f(x) = k(x,X) K^{-1} y$	Double/multiple descent, stability optimality
Sobolev/Banach function extension	Minimize $C^{1,1}$ or Sobolev seminorm under interpolation	Explicit optimality with unique extension
Sparse/interpolating ( $\ell_1$ )	$\hat\beta = \operatorname*{arg\,min}_{X\beta = y} \\|\beta\\|_1$	Consistency only if noise vanishes as $1/\log(d/n)$
Two-layer ReLU networks (Barron norm)	$\hat f = \operatorname*{arg\,min}_{f(x_i) = y_i} [f]$	Implicit bias, function/parameter norm separation

7. Impact and Open Questions

The theory of minimum-norm interpolating estimators illuminates the role of high-dimensional geometry, implicit/explicit regularization, and spectral structure in modern statistical learning. This understanding underpins the phenomena of benign overfitting, stability risk minimization, and the empirical success of overparameterized models without explicit complexity control. Open questions remain in characterizing universal consistency for more general data and kernel classes, quantifying implicit bias in deeper and non-convex neural architectures, and formulating minimax-optimal population-aware interpolators in practical regimes (Chinot et al., 2020, Chinot et al., 2020, Lecué et al., 2022, Oravkin et al., 2021, Park et al., 2023).