Error Decay in Network Approximation

Updated 4 March 2026

Error Decay in Network Approximation is defined via approximation spaces that measure how the best-achievable error decreases as network complexity (width, depth, parameters) increases, exhibiting regimes from polynomial to geometric rates.
Methodologies leverage direct and inverse Jackson/Bernstein inequalities to relate network architecture and activation functions to the approximation error and target smoothness, grounding the analysis in classical function spaces.
Architectural and training regimes, including gradient descent dynamics, skip-connections, and Taylorized training, critically influence error decay, establishing limits and enabling exponential or root-exponential convergence in various network setups.

Error decay in network approximation quantifies how the best-achievable approximation error for a target function decreases as the complexity of a neural network, measured by its width, depth, or total number of nonzero parameters, increases. Recent research rigorously formalizes this using approximation spaces, establishes sharp decay rates for a wide variety of architectures and function classes, and connects these rates to the network's structural and training properties. The regime of possible error-decay rates encompasses polynomial, root-exponential, exponential, and even geometric rates, depending on architecture, activation functions, target regularity, width-depth allocations, and algorithmic considerations.

1. Formalization via Approximation Spaces

Error decay is naturally captured by approximation spaces $A^r_q(X, \Sigma)$ , which classify functions $f$ in a normed space $X$ by the rate at which the best-approximation error $E_n(f) = \inf_{g \in \Sigma_n} \|f - g\|_X$ decays as the network complexity $n$ increases, where $\Sigma_n$ is the set of functions that can be realized by networks with complexity at most $n$ and bounded depth (or depth growing as a prescribed function of $n$ ).

The space $A^r(X, \Sigma)$ consists of those $f$ for which $\sup_{n \ge 1} n^r E_n(f) < \infty$ , corresponding to error decay $E_n(f) = O(n^{-r})$ . One can refine this to $A^r_q(X, \Sigma)$ by requiring $(\sum_n [n^r E_n(f)]^q n^{-1})^{1/q} < \infty$ ; for $q = \infty$ this is the same as $A^r$ .

Under mild hypotheses on the function spaces and network classes (e.g., nestedness, scale-invariance, density), $A^r_q$ are (quasi-)Banach spaces with inclusion relations mirroring those of classical interpolation spaces. Key direct ("Jackson") and inverse ("Bernstein") inequalities ensure that for $f \in A^r$ ,

$E_n(f) \leq C n^{-r} \|f\|_{A^r}, \quad \text{and conversely, } \|g\|_{A^r} \leq C n^r \|g\|_X \text{ for } g \in \Sigma_n.$

This precisely captures the tradeoff between network complexity and error decay (Gribonval et al., 2019).

2. Depth, Activation, and Structural Constraints

The achievable exponent $r$ in $E_n(f) = O(n^{-r})$ is fundamentally limited by depth and activation choice:

Depth Limitation: For ReLU activation, fixed depth $L$ severely restricts attainable $r$ . One cannot have a nontrivial function in $A^r$ for $r > 2\lfloor L/2\rfloor$ (or $r > 2(L-1)$ in neuron-count complexity). Constructing highly oscillatory "sawtooth" functions and counting the number of linear/polynomial pieces per depth provides tight lower bounds (Gribonval et al., 2019).
Activation Functions: For $\sigma_r(x) = (x_+)^r$ , the approximation spaces $A^r_q(X, \Sigma)$ remain nontrivial for any $r$ and $L \ge 2$ , and for spline activations of degree $r$ (with at least one knot), the same spaces are realized. These approximation spaces coincide with the universality properties of classical piecewise polynomial approximation.
Skip-connections: Allowing generalized layerwise activations (per-neuron "id" or activation) or skip-connections does not alter the resulting approximation spaces $A^r_q$ . Thus, the universality class is robust to such architecture modifications (Gribonval et al., 2019).

3. Direct and Inverse Estimates: Relation to Smoothness

For ReLU and higher-power activations, rigorous embeddings relate these network approximation spaces to classical smoothness (e.g., Besov spaces $B^s_{p, q}$ ):

Direct estimate: Using $\sigma_r(x)$ with sufficient depth, functions in $B^{d s}_{p, q}$ can be embedded into $A^s_q$ , providing

$E_n(f) \leq C n^{-s} \quad \text{for } s < \min\{r + 1, 1 + 1/p\}/d.$

For $d=1$ and $L \geq 2$ , $s < r + 1$ is admissible.

Inverse estimate: With finite depth $L$ and ReLU, any embedding $A^r_q(\sigma_1) \hookrightarrow B^s_{*, *}$ requires $s \leq 2\lfloor L/2\rfloor$ , sharply limiting recoverable Besov smoothness (Gribonval et al., 2019).

Approximation rates are thus bounded above (direct) and below (inverse) by the available network complexity and target smoothness, matching the best-known Jackson/Bernstein bounds for splines and wavelets, and linking deep network approximation directly to the classic theory of function spaces.

4. Architectural and Training Regimes: Beyond Polynomial Decay

Network approximation rates exhibit various regimes beyond the classical algebraic decay:

Exponential Error Decay: For many classes of smooth or oscillatory functions (e.g., polynomials, sinusoids, certain $C^\infty$ functions, Besov or modulation space balls), deep networks of polylogarithmic depth and finite width achieve exponential decrease:

$\varepsilon \leq C e^{-cM}$

where $M$ is the number of nonzero parameters or connections. This aligns with Kolmogorov-Donoho optimality (Elbrächter et al., 2019).

Root-exponential and superexponential rates: For Floor–ReLU architectures, uniform approximation of Hölder functions achieves

$\|f - f_N\|_\infty \leq 3\lambda d^{\alpha/2} N^{-\alpha \sqrt{L}}$

with $N$ width and $L$ depth, i.e., error is "reciprocal of width to the power $\sqrt{\text{depth}}$ ," yielding root-exponential convergence in depth and demonstrating substantial mitigation of the curse of dimensionality (Shen et al., 2020).

Geometric Error Decay: Multilevel neural network architectures (e.g., for AFEM-adapted solvers or extended Galerkin frameworks) can exhibit geometric error decay in the number of network levels/basis functions, e.g., $||u-u_n|| \le (2\epsilon)^n ||u-u_0||$ , given subspace enrichment that precisely targets singular structures and regular remainder (Schütte et al., 2024, Ainsworth et al., 2024).

5. Algorithmic and Training Effects on Error Decay

Error decay is also heavily influenced by training algorithms and initialization regimes:

Gradient Descent Dynamics: In the over-parameterized (so-called “lazy” or NTK) regime, gradient descent iteratively applies (in function space) powers of an integral operator $T$ whose eigenvalues control the geometric decay:

$\|e(t)\|_2 \leq (1 - \eta\lambda_\ell)^t \|f^*\|_{L^2} + O(\epsilon(f^*, \ell)) + O(n^{-1/2})$

where $\epsilon(f^*, \ell)$ is the error projecting onto the top- $\ell$ eigenspaces of $T$ (Su et al., 2019).

Empirical Error Decay under Gradient Flow: For networks trained on Sobolev-smooth targets, with constant depth and increasing width, the $L^2$ -error converges at rates $O(m^{-r/[4(r-\beta)]})$ , with exponents determined by smoothness $r$ and the NTK coercivity gap $\beta$ (Welper, 2023).
Taylorized Training: The error of the $k$ -th order Taylor expansion (centered at initialization) of a wide network decays exponentially with $k$ , i.e., $O(m^{-k/2}) = O(e^{-\alpha k})$ , showing that only modest $k$ is needed to close the gap with exact network dynamics for wide architectures (Bai et al., 2020).
Random Initialization Ensemble Methods: In stochastic training with multiple restarts, the overall error can be decomposed into explicit approximation, generalization, and optimization components, with improved exponential rates in depth and logarithmic scaling in width (Jentzen et al., 2020).

6. Limitations, Instabilities, and Lower Bounds

Fundamental limits and instabilities are intrinsic to network approximation:

Stability vs. Expressivity: The "space-filling" property of the manifold of network outputs under parameter optimization confers expressivity but can undermine numerical stability in finding best parameterizations. Stable parameter selection (Lipschitz maps) cannot outperform entropy-bounded rates $n^{-r}$ , while unstable/delicate methods could achieve superrates (log-factor improvements) in some regimes (DeVore et al., 2020).
Curse of Dimensionality and Activation Function: With general activation functions, error decay can be dimension-independent— $O(n^{-1/2})$ for polynomial decay activations or $O(n^{-1/4})$ for bounded, integrable activations—but with explicit $d$ -dependence for stratified sampling under increased smoothness (Siegel et al., 2019).
Width-Depth Trade-offs and Limits: Sharp lower bounds prove that fixed-depth, wide networks cannot achieve the same error decay (for high-frequency or highly oscillatory targets) as deep, narrow networks. Polylogarithmic-depth, finite-width constructions match Kolmogorov-Donoho optimal exponents only if a matching dictionary with polylog-depth implementations exists (Elbrächter et al., 2019).

7. Summary Table: Representative Error Decay Regimes

Network/Function Class	Error Decay Rate	Limiting Factor / Notes	Reference
ReLU, depth $L$ , width $n$	$O(n^{-r}),\ r\leq 2\lfloor L/2\rfloor$	Depth $L$ (sawtooth lower bound)	(Gribonval et al., 2019)
Floor–ReLU, width $N$ , depth $L$	$O(N^{-\alpha \sqrt{L}})$	"Root-exponential" in $\sqrt{L}$	(Shen et al., 2020)
Deep (polylog-depth), $M$ params	$O(e^{-cM})$	Exponential in parameter count	(Elbrächter et al., 2019)
Two-layer, poly-decay activation	$O(n^{-1/2})$	Polynomial decay of activation	(Siegel et al., 2019)
Bounded activation	$O(n^{-1/4})$	No decay; $L^\infty$ activation	(Siegel et al., 2019)
Super-convergence (deep, Besov-Sobolev)	$O(N^{-2s/d})$	Requires sufficient depth	(DeVore et al., 2020)
Gradient descent, NTK regime	$O((1-\eta\lambda_i)^t)$ (geometric in $t$ )	Controlled by leading eigenvalue	(Su et al., 2019)
Multilevel adaptive net	$O(\rho^L)$	Geometric in levels ( $\rho < 1$ )	(Schütte et al., 2024)

References

(Gribonval et al., 2019) Approximation spaces of deep neural networks
(Shen et al., 2020) Deep Network with Approximation Error Being Reciprocal of Width to Power of Square Root of Depth
(Elbrächter et al., 2019) Deep Neural Network Approximation Theory
(DeVore et al., 2020) Neural Network Approximation
(Jentzen et al., 2020) Strong overall error analysis for the training of artificial neural networks via random initializations
(Davis et al., 2024) Approximation Error and Complexity Bounds for ReLU Networks on Low-Regular Function Spaces
(Hutter et al., 19 Nov 2025) A Quantifier-Reversal Approximation Paradigm for Recurrent Neural Networks
(Schütte et al., 2024) Adaptive Multilevel Neural Networks for Parametric PDEs with Error Estimation
(Bai et al., 2020) Taylorized Training: Towards Better Approximation of Neural Network Training at Finite Width
(Welper, 2023) Approximation Results for Gradient Descent trained Neural Networks
(Siegel et al., 2019) Approximation Rates for Neural Networks with General Activation Functions
(Anastassiou, 2014) Univariate error function based neural network approximation

A comprehensive understanding of error decay in network approximation thus integrates sharp theoretical decay rates, robust structural principles, links to classical smoothness theory, algorithmic and training effects, as well as explicit limitations imposed by both network and problem class properties.