Minimax-Optimal Uniform Approximation Rates

Updated 5 February 2026

The paper develops a rigorous framework for minimax-optimal uniform approximation rates that capture the sharp decay of sup-norm errors across diverse function classes and data regimes.
It leverages both classical nonparametric regression techniques and modern deep learning tools to derive rates influenced by smoothness, ambient dimension, sample size, and noise level.
The analysis unifies deterministic approximation and statistical estimation, offering insights into optimal recovery, derivative control, and adaptive estimator performance.

Minimax-optimal uniform approximation rates quantify the sharpest possible decay of approximation or estimation error, measured in the supremum norm, for a given function class and data regime. These rates capture the optimal tradeoff between function smoothness, ambient dimension, sample size, and, when relevant, noise level or model ill-posedness. This article develops the formulation, classical results, and recent advances in minimax-optimal uniform rates for a wide range of function classes and stochastic models.

1. Formulation and General Principles

For a normed function class $\mathcal{F}$ —typically a Hölder or Besov ball on a domain $\Omega \subset \mathbb{R}^d$ —the minimax uniform (sup-norm) risk over $n$ samples is defined as

$R_n^*(\mathcal{F}) = \inf_{\hat{f}_n} \sup_{f \in \mathcal{F}} \mathbb{E} \|\hat{f}_n - f\|_{\infty}$

for nonparametric estimation, or

$E_N^*(\mathcal{F}) = \inf_{f_N \in \mathcal{A}_N} \sup_{f \in \mathcal{F}} \|f_N - f\|_{\infty}$

for non-statistical (deterministic) uniform approximation on $\mathcal{A}_N$ , a set of $N$ -term dictionaries (splines, polynomials, neural networks, etc).

Minimax-optimality means that (up to constants and possibly logarithmic factors) the attained rate of $R_n^*$ or $E_N^*$ cannot be improved for any algorithm or approximant sequence, uniformly over $\mathcal{F}$ .

2. Classical Sup-Norm Minimax Rates in Nonparametric Regression

For nonparametric regression, minimax uniform rates are governed by smoothness $\alpha$ and the effective dimension $d$ —with the archetypal result: $R_n^*(\mathcal{H}^{\alpha}) \asymp n^{-\alpha/(2\alpha + d)}$ holding for the Hölder class $\mathcal{H}^{\alpha}$ on $[0,1]^d$ , as established in kernel and spline approximation theory.

For dyadic-dependent outcomes (as in nonparametric dyadic regression), the effective sample size and dimensionality change fundamentally:

Although $N(N-1)$ outcomes are observed, dyadic dependence yields effective size $N$ and dimension $d_X$ (half the regressor pair dimension).
The minimax rate for sup-norm estimation in the dyadic model is

$R_*^{(\infty)}(N) \ge C \cdot N^{-{\alpha/(2\alpha + d_X)}}$

and is matched (up to a logarithmic factor) by a dyadic Nadaraya–Watson kernel estimator

$\|\hat{g}_N - g\|_\infty = O_p\left((\ln N / N)^{\alpha/(2\alpha + d_X)}\right)$

as proven in (Graham et al., 2020).

A similar paradigm applies to nonparametric IV regression; under mildly or severely ill-posed operators ( $T$ with polynomial or exponential singular value decay), the minimax sup-norm rate is

$\|\hat{h}_n - h_0\|_\infty = O_p\left((n/\log n)^{-p/(2(p+a)+d)}\right)$

for mildly ill-posed, and

$O_p\left((\log n)^{-p/b}\right)$

for severely ill-posed problems, as established in (Chen et al., 2015). The series 2SLS ("sieve NPIV") estimator attains these rates.

3. Minimax Rates for Uniform Approximation by Dictionaries

In deterministic approximation, optimal polynomial approximations to singular functions such as the checkmark $|x-\alpha|$ achieve

$E_n(\alpha) = \frac{\sigma \sqrt{1-\alpha^2}}{n} + o(n^{-1})$

for large $n$ , with $\sigma$ the Bernstein constant, and where the $E_n(\alpha)$ evolve in piecewise-analytic "V-shapes" as characterized in (Dragnev et al., 2021).

For variable dictionaries, such as shallow neural nets or ReLU activation dictionaries, uniform approximation rates on Barron-type (variation) classes are

$\inf_{f_N \in \Sigma_N(\mathcal{P}_k^d)} \|f - f_N\|_{L^\infty} \le C n^{-1/2 - (2k+1)/(2d)}$

for an $n$ -term shallow ReLU $^k$ net, with $k$ the activation order and $d$ the ambient dimension. These exponents are minimax-optimal, matching lower bounds up to constants—without extraneous logarithmic penalties (Siegel, 2023).

For classical Hölder classes mapped by shallow ReLU $^k$ networks, a different regime appears: uniform approximation rates by width- $N$ , weight-bounded networks satisfy

$\|h - f_N\|_\infty = O(N^{-\alpha / d})$

for $\alpha < (d + 2k + 1)/2$ (Yang et al., 2023); this $N^{-\alpha/d}$ exponent is minimax-optimal up to log factors. Nonlinear $n$ -widths and VC-dimension arguments confirm this sharpness.

4. Noise-Level-Aware Minimax Uniform Rates and Optimal Recovery

Recent advances reconcile minimax estimation under noise with classical optimal recovery theory, yielding "Noise-Level-Aware" (NLA) rates for $L_\infty$ minimax risk over Besov balls $B^s_{p, r}$ in $d$ dimensions: $R_m^{(\infty)} = m^{-s/d + 1/p} + \min\left\{1, \left(\frac{\sigma^2}{m}\right)^{s/(2s + d)} \right\}$ for $m$ samples and noise variance $\sigma^2$ (with $p \leq r \leq \infty$ , $s > d/p$ ). As $\sigma \to 0$ , the rate collapses to the classical optimal recovery exponent $m^{-s/d + 1/p}$ (DeVore et al., 24 Feb 2025). For fixed $\sigma > 0$ , the second term dominates until the sample size passes $m \gg \sigma^{-(2s+d)/s}$ , after which the error is limited primarily by the function class and dimension.

This interpolating rate robustly quantifies the transition between noise-limited statistical learning and noiseless approximation scenarios.

5. Minimax Rates for Score Estimation and Transformers

For high-dimensional generative modeling, minimax uniform approximation rates for conditional score functions $\nabla_x \log p_t(x | y)$ by transformer architectures under Hölder smoothness are

$\epsilon_N^* = \Theta(N^{-\beta / (d_x + d_y)})$

where $N$ denotes grid resolution, $\beta$ is smoothness, and $d_x + d_y$ is the total input dimension. The phenomenon persists for conditional diffusion transformers (DiTs) under various assumptions: one-head, one-block transformers attain the minimax-optimal rates (up to polylogarithmic factors), matching scalar regression benchmarks (Hu et al., 2024).

The critical insight is that error exponents are governed strictly by composite smoothness and total input dimensionality, regardless of architecture (provided universal approximation is retained).

6. Simultaneous Uniform Approximation of Functions and Derivatives

For variation classes of shallow neural networks (e.g., Barron-type ReLU $^k$ spaces), the analysis extends to simultaneous uniform sup-norm control over all derivatives up to order $m \leq k$ , with identical minimax exponents: $\sup_{|\alpha| \leq m}\| D^\alpha f - D^\alpha f_N \|_{L^\infty} \leq C n^{-1/2 - (2(k-m) + 1)/2d}$ This result is the first to give explicit uniform error rates for all derivatives in high dimension without superfluous logarithmic factors (Siegel, 2023).

7. Connections to Adaptive Estimators and Implementation

Adaptivity in minimax risk appears in estimators such as the Kozachenko-Leonenko nearest-neighbor entropy estimator, which attains the minimax rate for differential entropy over Hölder balls up to logarithmic factors and without knowledge of the smoothness parameter: $R_n \asymp n^{-s/(s+d)}(\ln n)^{-(s+2d)/(s+d)}$ This adaptivity property—having nearly minimax-optimal rate across smoothness classes and dimensions—is particularly valuable in practical, high-dimensional contexts (Jiao et al., 2017).

Function Class / Model	Minimax Sup-norm Rate	Reference
Hölder (scalar regression)	$n^{-\alpha/(2\alpha+d)}$	(Graham et al., 2020, Yang et al., 2023)
Dyadic regression	$N^{-\alpha/(2\alpha+d_X)}$	(Graham et al., 2020)
NPIV, mildly ill-posed	$(n/\log n)^{-p/(2(p+a)+d)}$	(Chen et al., 2015)
NPIV, severely ill-posed	$(\log n)^{-p/b}$	(Chen et al., 2015)
ReLU $^k$ Barron class ( $n$ -width)	$n^{-1/2-(2k+1)/(2d)}$	(Siegel, 2023)
Hölder via ReLU $^k$ nets	$N^{-\alpha/d}$	(Yang et al., 2023, Siegel, 2023)
Besov–NLA	$m^{-s/d+1/p} + \min\{1, (\sigma^2/m)^{s/(2s+d)}\}$	(DeVore et al., 24 Feb 2025)
Conditional DiT score	$N^{-\beta/(d_x+d_y)}$	(Hu et al., 2024)
Entropy over Hölder balls	$n^{-s/(s+d)}(\ln n)^{-(s+2d)/(s+d)}$	(Jiao et al., 2017)

These rates provide a unified framework for quantifying the best attainable uniform approximation in high-dimensional, nonparametric, and noise-limited regimes.

Minimax-Optimal Uniform Approximation Rates

1. Formulation and General Principles

2. Classical Sup-Norm Minimax Rates in Nonparametric Regression

3. Minimax Rates for Uniform Approximation by Dictionaries

4. Noise-Level-Aware Minimax Uniform Rates and Optimal Recovery

5. Minimax Rates for Score Estimation and Transformers

6. Simultaneous Uniform Approximation of Functions and Derivatives

7. Connections to Adaptive Estimators and Implementation

Further Reading

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Minimax-Optimal Uniform Approximation Rates

1. Formulation and General Principles

2. Classical Sup-Norm Minimax Rates in Nonparametric Regression

3. Minimax Rates for Uniform Approximation by Dictionaries

4. Noise-Level-Aware Minimax Uniform Rates and Optimal Recovery

5. Minimax Rates for Score Estimation and Transformers

6. Simultaneous Uniform Approximation of Functions and Derivatives

7. Connections to Adaptive Estimators and Implementation

Further Reading

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics