Papers
Topics
Authors
Recent
Search
2000 character limit reached

Minimax-Optimal Uniform Approximation Rates

Updated 5 February 2026
  • The paper develops a rigorous framework for minimax-optimal uniform approximation rates that capture the sharp decay of sup-norm errors across diverse function classes and data regimes.
  • It leverages both classical nonparametric regression techniques and modern deep learning tools to derive rates influenced by smoothness, ambient dimension, sample size, and noise level.
  • The analysis unifies deterministic approximation and statistical estimation, offering insights into optimal recovery, derivative control, and adaptive estimator performance.

Minimax-optimal uniform approximation rates quantify the sharpest possible decay of approximation or estimation error, measured in the supremum norm, for a given function class and data regime. These rates capture the optimal tradeoff between function smoothness, ambient dimension, sample size, and, when relevant, noise level or model ill-posedness. This article develops the formulation, classical results, and recent advances in minimax-optimal uniform rates for a wide range of function classes and stochastic models.

1. Formulation and General Principles

For a normed function class F\mathcal{F}—typically a Hölder or Besov ball on a domain ΩRd\Omega \subset \mathbb{R}^d—the minimax uniform (sup-norm) risk over nn samples is defined as

Rn(F)=inff^nsupfFEf^nfR_n^*(\mathcal{F}) = \inf_{\hat{f}_n} \sup_{f \in \mathcal{F}} \mathbb{E} \|\hat{f}_n - f\|_{\infty}

for nonparametric estimation, or

EN(F)=inffNANsupfFfNfE_N^*(\mathcal{F}) = \inf_{f_N \in \mathcal{A}_N} \sup_{f \in \mathcal{F}} \|f_N - f\|_{\infty}

for non-statistical (deterministic) uniform approximation on AN\mathcal{A}_N, a set of NN-term dictionaries (splines, polynomials, neural networks, etc).

Minimax-optimality means that (up to constants and possibly logarithmic factors) the attained rate of RnR_n^* or ENE_N^* cannot be improved for any algorithm or approximant sequence, uniformly over F\mathcal{F}.

2. Classical Sup-Norm Minimax Rates in Nonparametric Regression

For nonparametric regression, minimax uniform rates are governed by smoothness α\alpha and the effective dimension dd—with the archetypal result: Rn(Hα)nα/(2α+d)R_n^*(\mathcal{H}^{\alpha}) \asymp n^{-\alpha/(2\alpha + d)} holding for the Hölder class Hα\mathcal{H}^{\alpha} on [0,1]d[0,1]^d, as established in kernel and spline approximation theory.

For dyadic-dependent outcomes (as in nonparametric dyadic regression), the effective sample size and dimensionality change fundamentally:

  • Although N(N1)N(N-1) outcomes are observed, dyadic dependence yields effective size NN and dimension dXd_X (half the regressor pair dimension).
  • The minimax rate for sup-norm estimation in the dyadic model is

R()(N)CNα/(2α+dX)R_*^{(\infty)}(N) \ge C \cdot N^{-{\alpha/(2\alpha + d_X)}}

and is matched (up to a logarithmic factor) by a dyadic Nadaraya–Watson kernel estimator

g^Ng=Op((lnN/N)α/(2α+dX))\|\hat{g}_N - g\|_\infty = O_p\left((\ln N / N)^{\alpha/(2\alpha + d_X)}\right)

as proven in (Graham et al., 2020).

A similar paradigm applies to nonparametric IV regression; under mildly or severely ill-posed operators (TT with polynomial or exponential singular value decay), the minimax sup-norm rate is

h^nh0=Op((n/logn)p/(2(p+a)+d))\|\hat{h}_n - h_0\|_\infty = O_p\left((n/\log n)^{-p/(2(p+a)+d)}\right)

for mildly ill-posed, and

Op((logn)p/b)O_p\left((\log n)^{-p/b}\right)

for severely ill-posed problems, as established in (Chen et al., 2015). The series 2SLS ("sieve NPIV") estimator attains these rates.

3. Minimax Rates for Uniform Approximation by Dictionaries

In deterministic approximation, optimal polynomial approximations to singular functions such as the checkmark xα|x-\alpha| achieve

En(α)=σ1α2n+o(n1)E_n(\alpha) = \frac{\sigma \sqrt{1-\alpha^2}}{n} + o(n^{-1})

for large nn, with σ\sigma the Bernstein constant, and where the En(α)E_n(\alpha) evolve in piecewise-analytic "V-shapes" as characterized in (Dragnev et al., 2021).

For variable dictionaries, such as shallow neural nets or ReLU activation dictionaries, uniform approximation rates on Barron-type (variation) classes are

inffNΣN(Pkd)ffNLCn1/2(2k+1)/(2d)\inf_{f_N \in \Sigma_N(\mathcal{P}_k^d)} \|f - f_N\|_{L^\infty} \le C n^{-1/2 - (2k+1)/(2d)}

for an nn-term shallow ReLUk^k net, with kk the activation order and dd the ambient dimension. These exponents are minimax-optimal, matching lower bounds up to constants—without extraneous logarithmic penalties (Siegel, 2023).

For classical Hölder classes mapped by shallow ReLUk^k networks, a different regime appears: uniform approximation rates by width-NN, weight-bounded networks satisfy

hfN=O(Nα/d)\|h - f_N\|_\infty = O(N^{-\alpha / d})

for α<(d+2k+1)/2\alpha < (d + 2k + 1)/2 (Yang et al., 2023); this Nα/dN^{-\alpha/d} exponent is minimax-optimal up to log factors. Nonlinear nn-widths and VC-dimension arguments confirm this sharpness.

4. Noise-Level-Aware Minimax Uniform Rates and Optimal Recovery

Recent advances reconcile minimax estimation under noise with classical optimal recovery theory, yielding "Noise-Level-Aware" (NLA) rates for LL_\infty minimax risk over Besov balls Bp,rsB^s_{p, r} in dd dimensions: Rm()=ms/d+1/p+min{1,(σ2m)s/(2s+d)}R_m^{(\infty)} = m^{-s/d + 1/p} + \min\left\{1, \left(\frac{\sigma^2}{m}\right)^{s/(2s + d)} \right\} for mm samples and noise variance σ2\sigma^2 (with prp \leq r \leq \infty, s>d/ps > d/p). As σ0\sigma \to 0, the rate collapses to the classical optimal recovery exponent ms/d+1/pm^{-s/d + 1/p} (DeVore et al., 24 Feb 2025). For fixed σ>0\sigma > 0, the second term dominates until the sample size passes mσ(2s+d)/sm \gg \sigma^{-(2s+d)/s}, after which the error is limited primarily by the function class and dimension.

This interpolating rate robustly quantifies the transition between noise-limited statistical learning and noiseless approximation scenarios.

5. Minimax Rates for Score Estimation and Transformers

For high-dimensional generative modeling, minimax uniform approximation rates for conditional score functions xlogpt(xy)\nabla_x \log p_t(x | y) by transformer architectures under Hölder smoothness are

ϵN=Θ(Nβ/(dx+dy))\epsilon_N^* = \Theta(N^{-\beta / (d_x + d_y)})

where NN denotes grid resolution, β\beta is smoothness, and dx+dyd_x + d_y is the total input dimension. The phenomenon persists for conditional diffusion transformers (DiTs) under various assumptions: one-head, one-block transformers attain the minimax-optimal rates (up to polylogarithmic factors), matching scalar regression benchmarks (Hu et al., 2024).

The critical insight is that error exponents are governed strictly by composite smoothness and total input dimensionality, regardless of architecture (provided universal approximation is retained).

6. Simultaneous Uniform Approximation of Functions and Derivatives

For variation classes of shallow neural networks (e.g., Barron-type ReLUk^k spaces), the analysis extends to simultaneous uniform sup-norm control over all derivatives up to order mkm \leq k, with identical minimax exponents: supαmDαfDαfNLCn1/2(2(km)+1)/2d\sup_{|\alpha| \leq m}\| D^\alpha f - D^\alpha f_N \|_{L^\infty} \leq C n^{-1/2 - (2(k-m) + 1)/2d} This result is the first to give explicit uniform error rates for all derivatives in high dimension without superfluous logarithmic factors (Siegel, 2023).

7. Connections to Adaptive Estimators and Implementation

Adaptivity in minimax risk appears in estimators such as the Kozachenko-Leonenko nearest-neighbor entropy estimator, which attains the minimax rate for differential entropy over Hölder balls up to logarithmic factors and without knowledge of the smoothness parameter: Rnns/(s+d)(lnn)(s+2d)/(s+d)R_n \asymp n^{-s/(s+d)}(\ln n)^{-(s+2d)/(s+d)} This adaptivity property—having nearly minimax-optimal rate across smoothness classes and dimensions—is particularly valuable in practical, high-dimensional contexts (Jiao et al., 2017).

Function Class / Model Minimax Sup-norm Rate Reference
Hölder (scalar regression) nα/(2α+d)n^{-\alpha/(2\alpha+d)} (Graham et al., 2020, Yang et al., 2023)
Dyadic regression Nα/(2α+dX)N^{-\alpha/(2\alpha+d_X)} (Graham et al., 2020)
NPIV, mildly ill-posed (n/logn)p/(2(p+a)+d)(n/\log n)^{-p/(2(p+a)+d)} (Chen et al., 2015)
NPIV, severely ill-posed (logn)p/b(\log n)^{-p/b} (Chen et al., 2015)
ReLUk^k Barron class (nn-width) n1/2(2k+1)/(2d)n^{-1/2-(2k+1)/(2d)} (Siegel, 2023)
Hölder via ReLUk^k nets Nα/dN^{-\alpha/d} (Yang et al., 2023, Siegel, 2023)
Besov–NLA ms/d+1/p+min{1,(σ2/m)s/(2s+d)}m^{-s/d+1/p} + \min\{1, (\sigma^2/m)^{s/(2s+d)}\} (DeVore et al., 24 Feb 2025)
Conditional DiT score Nβ/(dx+dy)N^{-\beta/(d_x+d_y)} (Hu et al., 2024)
Entropy over Hölder balls ns/(s+d)(lnn)(s+2d)/(s+d)n^{-s/(s+d)}(\ln n)^{-(s+2d)/(s+d)} (Jiao et al., 2017)

These rates provide a unified framework for quantifying the best attainable uniform approximation in high-dimensional, nonparametric, and noise-limited regimes.

Further Reading

For detail on proofs, constants, and architectures, see:

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Minimax-Optimal Uniform Approximation Rates.