Minimax-Optimal Uniform Approximation Rates
- The paper develops a rigorous framework for minimax-optimal uniform approximation rates that capture the sharp decay of sup-norm errors across diverse function classes and data regimes.
- It leverages both classical nonparametric regression techniques and modern deep learning tools to derive rates influenced by smoothness, ambient dimension, sample size, and noise level.
- The analysis unifies deterministic approximation and statistical estimation, offering insights into optimal recovery, derivative control, and adaptive estimator performance.
Minimax-optimal uniform approximation rates quantify the sharpest possible decay of approximation or estimation error, measured in the supremum norm, for a given function class and data regime. These rates capture the optimal tradeoff between function smoothness, ambient dimension, sample size, and, when relevant, noise level or model ill-posedness. This article develops the formulation, classical results, and recent advances in minimax-optimal uniform rates for a wide range of function classes and stochastic models.
1. Formulation and General Principles
For a normed function class —typically a Hölder or Besov ball on a domain —the minimax uniform (sup-norm) risk over samples is defined as
for nonparametric estimation, or
for non-statistical (deterministic) uniform approximation on , a set of -term dictionaries (splines, polynomials, neural networks, etc).
Minimax-optimality means that (up to constants and possibly logarithmic factors) the attained rate of or cannot be improved for any algorithm or approximant sequence, uniformly over .
2. Classical Sup-Norm Minimax Rates in Nonparametric Regression
For nonparametric regression, minimax uniform rates are governed by smoothness and the effective dimension —with the archetypal result: holding for the Hölder class on , as established in kernel and spline approximation theory.
For dyadic-dependent outcomes (as in nonparametric dyadic regression), the effective sample size and dimensionality change fundamentally:
- Although outcomes are observed, dyadic dependence yields effective size and dimension (half the regressor pair dimension).
- The minimax rate for sup-norm estimation in the dyadic model is
and is matched (up to a logarithmic factor) by a dyadic Nadaraya–Watson kernel estimator
as proven in (Graham et al., 2020).
A similar paradigm applies to nonparametric IV regression; under mildly or severely ill-posed operators ( with polynomial or exponential singular value decay), the minimax sup-norm rate is
for mildly ill-posed, and
for severely ill-posed problems, as established in (Chen et al., 2015). The series 2SLS ("sieve NPIV") estimator attains these rates.
3. Minimax Rates for Uniform Approximation by Dictionaries
In deterministic approximation, optimal polynomial approximations to singular functions such as the checkmark achieve
for large , with the Bernstein constant, and where the evolve in piecewise-analytic "V-shapes" as characterized in (Dragnev et al., 2021).
For variable dictionaries, such as shallow neural nets or ReLU activation dictionaries, uniform approximation rates on Barron-type (variation) classes are
for an -term shallow ReLU net, with the activation order and the ambient dimension. These exponents are minimax-optimal, matching lower bounds up to constants—without extraneous logarithmic penalties (Siegel, 2023).
For classical Hölder classes mapped by shallow ReLU networks, a different regime appears: uniform approximation rates by width-, weight-bounded networks satisfy
for (Yang et al., 2023); this exponent is minimax-optimal up to log factors. Nonlinear -widths and VC-dimension arguments confirm this sharpness.
4. Noise-Level-Aware Minimax Uniform Rates and Optimal Recovery
Recent advances reconcile minimax estimation under noise with classical optimal recovery theory, yielding "Noise-Level-Aware" (NLA) rates for minimax risk over Besov balls in dimensions: for samples and noise variance (with , ). As , the rate collapses to the classical optimal recovery exponent (DeVore et al., 24 Feb 2025). For fixed , the second term dominates until the sample size passes , after which the error is limited primarily by the function class and dimension.
This interpolating rate robustly quantifies the transition between noise-limited statistical learning and noiseless approximation scenarios.
5. Minimax Rates for Score Estimation and Transformers
For high-dimensional generative modeling, minimax uniform approximation rates for conditional score functions by transformer architectures under Hölder smoothness are
where denotes grid resolution, is smoothness, and is the total input dimension. The phenomenon persists for conditional diffusion transformers (DiTs) under various assumptions: one-head, one-block transformers attain the minimax-optimal rates (up to polylogarithmic factors), matching scalar regression benchmarks (Hu et al., 2024).
The critical insight is that error exponents are governed strictly by composite smoothness and total input dimensionality, regardless of architecture (provided universal approximation is retained).
6. Simultaneous Uniform Approximation of Functions and Derivatives
For variation classes of shallow neural networks (e.g., Barron-type ReLU spaces), the analysis extends to simultaneous uniform sup-norm control over all derivatives up to order , with identical minimax exponents: This result is the first to give explicit uniform error rates for all derivatives in high dimension without superfluous logarithmic factors (Siegel, 2023).
7. Connections to Adaptive Estimators and Implementation
Adaptivity in minimax risk appears in estimators such as the Kozachenko-Leonenko nearest-neighbor entropy estimator, which attains the minimax rate for differential entropy over Hölder balls up to logarithmic factors and without knowledge of the smoothness parameter: This adaptivity property—having nearly minimax-optimal rate across smoothness classes and dimensions—is particularly valuable in practical, high-dimensional contexts (Jiao et al., 2017).
| Function Class / Model | Minimax Sup-norm Rate | Reference |
|---|---|---|
| Hölder (scalar regression) | (Graham et al., 2020, Yang et al., 2023) | |
| Dyadic regression | (Graham et al., 2020) | |
| NPIV, mildly ill-posed | (Chen et al., 2015) | |
| NPIV, severely ill-posed | (Chen et al., 2015) | |
| ReLU Barron class (-width) | (Siegel, 2023) | |
| Hölder via ReLU nets | (Yang et al., 2023, Siegel, 2023) | |
| Besov–NLA | (DeVore et al., 24 Feb 2025) | |
| Conditional DiT score | (Hu et al., 2024) | |
| Entropy over Hölder balls | (Jiao et al., 2017) |
These rates provide a unified framework for quantifying the best attainable uniform approximation in high-dimensional, nonparametric, and noise-limited regimes.
Further Reading
For detail on proofs, constants, and architectures, see:
- (Graham et al., 2020): Nonparametric dyadic regression minimax rates.
- (Siegel, 2023): Dimension-explicit rates for shallow ReLU networks and uniform approximation of derivatives.
- (Yang et al., 2023): Minimax rates for Hölder functions via shallow neural nets; regression implications.
- (Dragnev et al., 2021): Minimax structure for polynomial approximation to checkmark functions.
- (Chen et al., 2015): NPIV minimax rates and convergence under ill-posedness.
- (DeVore et al., 24 Feb 2025): Noise-Level-Aware minimax estimation and optimal recovery for Besov balls.
- (Hu et al., 2024): Minimax-optimal rates for conditional diffusion transformers and score function approximation.
- (Jiao et al., 2017): Adaptive minimax rates for entropy estimation over Hölder classes.