Papers
Topics
Authors
Recent
Search
2000 character limit reached

Error Decay in Network Approximation

Updated 4 March 2026
  • Error Decay in Network Approximation is defined via approximation spaces that measure how the best-achievable error decreases as network complexity (width, depth, parameters) increases, exhibiting regimes from polynomial to geometric rates.
  • Methodologies leverage direct and inverse Jackson/Bernstein inequalities to relate network architecture and activation functions to the approximation error and target smoothness, grounding the analysis in classical function spaces.
  • Architectural and training regimes, including gradient descent dynamics, skip-connections, and Taylorized training, critically influence error decay, establishing limits and enabling exponential or root-exponential convergence in various network setups.

Error decay in network approximation quantifies how the best-achievable approximation error for a target function decreases as the complexity of a neural network, measured by its width, depth, or total number of nonzero parameters, increases. Recent research rigorously formalizes this using approximation spaces, establishes sharp decay rates for a wide variety of architectures and function classes, and connects these rates to the network's structural and training properties. The regime of possible error-decay rates encompasses polynomial, root-exponential, exponential, and even geometric rates, depending on architecture, activation functions, target regularity, width-depth allocations, and algorithmic considerations.

1. Formalization via Approximation Spaces

Error decay is naturally captured by approximation spaces Aqr(X,Σ)A^r_q(X, \Sigma), which classify functions ff in a normed space XX by the rate at which the best-approximation error En(f)=infgΣnfgXE_n(f) = \inf_{g \in \Sigma_n} \|f - g\|_X decays as the network complexity nn increases, where Σn\Sigma_n is the set of functions that can be realized by networks with complexity at most nn and bounded depth (or depth growing as a prescribed function of nn).

The space Ar(X,Σ)A^r(X, \Sigma) consists of those ff for which supn1nrEn(f)<\sup_{n \ge 1} n^r E_n(f) < \infty, corresponding to error decay En(f)=O(nr)E_n(f) = O(n^{-r}). One can refine this to Aqr(X,Σ)A^r_q(X, \Sigma) by requiring (n[nrEn(f)]qn1)1/q<(\sum_n [n^r E_n(f)]^q n^{-1})^{1/q} < \infty; for q=q = \infty this is the same as ArA^r.

Under mild hypotheses on the function spaces and network classes (e.g., nestedness, scale-invariance, density), AqrA^r_q are (quasi-)Banach spaces with inclusion relations mirroring those of classical interpolation spaces. Key direct ("Jackson") and inverse ("Bernstein") inequalities ensure that for fArf \in A^r,

En(f)CnrfAr,and conversely, gArCnrgX for gΣn.E_n(f) \leq C n^{-r} \|f\|_{A^r}, \quad \text{and conversely, } \|g\|_{A^r} \leq C n^r \|g\|_X \text{ for } g \in \Sigma_n.

This precisely captures the tradeoff between network complexity and error decay (Gribonval et al., 2019).

2. Depth, Activation, and Structural Constraints

The achievable exponent rr in En(f)=O(nr)E_n(f) = O(n^{-r}) is fundamentally limited by depth and activation choice:

  • Depth Limitation: For ReLU activation, fixed depth LL severely restricts attainable rr. One cannot have a nontrivial function in ArA^r for r>2L/2r > 2\lfloor L/2\rfloor (or r>2(L1)r > 2(L-1) in neuron-count complexity). Constructing highly oscillatory "sawtooth" functions and counting the number of linear/polynomial pieces per depth provides tight lower bounds (Gribonval et al., 2019).
  • Activation Functions: For σr(x)=(x+)r\sigma_r(x) = (x_+)^r, the approximation spaces Aqr(X,Σ)A^r_q(X, \Sigma) remain nontrivial for any rr and L2L \ge 2, and for spline activations of degree rr (with at least one knot), the same spaces are realized. These approximation spaces coincide with the universality properties of classical piecewise polynomial approximation.
  • Skip-connections: Allowing generalized layerwise activations (per-neuron "id" or activation) or skip-connections does not alter the resulting approximation spaces AqrA^r_q. Thus, the universality class is robust to such architecture modifications (Gribonval et al., 2019).

3. Direct and Inverse Estimates: Relation to Smoothness

For ReLU and higher-power activations, rigorous embeddings relate these network approximation spaces to classical smoothness (e.g., Besov spaces Bp,qsB^s_{p, q}):

  • Direct estimate: Using σr(x)\sigma_r(x) with sufficient depth, functions in Bp,qdsB^{d s}_{p, q} can be embedded into AqsA^s_q, providing

En(f)Cnsfor s<min{r+1,1+1/p}/d.E_n(f) \leq C n^{-s} \quad \text{for } s < \min\{r + 1, 1 + 1/p\}/d.

For d=1d=1 and L2L \geq 2, s<r+1s < r + 1 is admissible.

  • Inverse estimate: With finite depth LL and ReLU, any embedding Aqr(σ1)B,sA^r_q(\sigma_1) \hookrightarrow B^s_{*, *} requires s2L/2s \leq 2\lfloor L/2\rfloor, sharply limiting recoverable Besov smoothness (Gribonval et al., 2019).

Approximation rates are thus bounded above (direct) and below (inverse) by the available network complexity and target smoothness, matching the best-known Jackson/Bernstein bounds for splines and wavelets, and linking deep network approximation directly to the classic theory of function spaces.

4. Architectural and Training Regimes: Beyond Polynomial Decay

Network approximation rates exhibit various regimes beyond the classical algebraic decay:

  • Exponential Error Decay: For many classes of smooth or oscillatory functions (e.g., polynomials, sinusoids, certain CC^\infty functions, Besov or modulation space balls), deep networks of polylogarithmic depth and finite width achieve exponential decrease:

εCecM\varepsilon \leq C e^{-cM}

where MM is the number of nonzero parameters or connections. This aligns with Kolmogorov-Donoho optimality (Elbrächter et al., 2019).

  • Root-exponential and superexponential rates: For Floor–ReLU architectures, uniform approximation of Hölder functions achieves

ffN3λdα/2NαL\|f - f_N\|_\infty \leq 3\lambda d^{\alpha/2} N^{-\alpha \sqrt{L}}

with NN width and LL depth, i.e., error is "reciprocal of width to the power depth\sqrt{\text{depth}}," yielding root-exponential convergence in depth and demonstrating substantial mitigation of the curse of dimensionality (Shen et al., 2020).

  • Geometric Error Decay: Multilevel neural network architectures (e.g., for AFEM-adapted solvers or extended Galerkin frameworks) can exhibit geometric error decay in the number of network levels/basis functions, e.g., uun(2ϵ)nuu0||u-u_n|| \le (2\epsilon)^n ||u-u_0||, given subspace enrichment that precisely targets singular structures and regular remainder (Schütte et al., 2024, Ainsworth et al., 2024).

5. Algorithmic and Training Effects on Error Decay

Error decay is also heavily influenced by training algorithms and initialization regimes:

  • Gradient Descent Dynamics: In the over-parameterized (so-called “lazy” or NTK) regime, gradient descent iteratively applies (in function space) powers of an integral operator TT whose eigenvalues control the geometric decay:

e(t)2(1ηλ)tfL2+O(ϵ(f,))+O(n1/2)\|e(t)\|_2 \leq (1 - \eta\lambda_\ell)^t \|f^*\|_{L^2} + O(\epsilon(f^*, \ell)) + O(n^{-1/2})

where ϵ(f,)\epsilon(f^*, \ell) is the error projecting onto the top-\ell eigenspaces of TT (Su et al., 2019).

  • Empirical Error Decay under Gradient Flow: For networks trained on Sobolev-smooth targets, with constant depth and increasing width, the L2L^2-error converges at rates O(mr/[4(rβ)])O(m^{-r/[4(r-\beta)]}), with exponents determined by smoothness rr and the NTK coercivity gap β\beta (Welper, 2023).
  • Taylorized Training: The error of the kk-th order Taylor expansion (centered at initialization) of a wide network decays exponentially with kk, i.e., O(mk/2)=O(eαk)O(m^{-k/2}) = O(e^{-\alpha k}), showing that only modest kk is needed to close the gap with exact network dynamics for wide architectures (Bai et al., 2020).
  • Random Initialization Ensemble Methods: In stochastic training with multiple restarts, the overall error can be decomposed into explicit approximation, generalization, and optimization components, with improved exponential rates in depth and logarithmic scaling in width (Jentzen et al., 2020).

6. Limitations, Instabilities, and Lower Bounds

Fundamental limits and instabilities are intrinsic to network approximation:

  • Stability vs. Expressivity: The "space-filling" property of the manifold of network outputs under parameter optimization confers expressivity but can undermine numerical stability in finding best parameterizations. Stable parameter selection (Lipschitz maps) cannot outperform entropy-bounded rates nrn^{-r}, while unstable/delicate methods could achieve superrates (log-factor improvements) in some regimes (DeVore et al., 2020).
  • Curse of Dimensionality and Activation Function: With general activation functions, error decay can be dimension-independent—O(n1/2)O(n^{-1/2}) for polynomial decay activations or O(n1/4)O(n^{-1/4}) for bounded, integrable activations—but with explicit dd-dependence for stratified sampling under increased smoothness (Siegel et al., 2019).
  • Width-Depth Trade-offs and Limits: Sharp lower bounds prove that fixed-depth, wide networks cannot achieve the same error decay (for high-frequency or highly oscillatory targets) as deep, narrow networks. Polylogarithmic-depth, finite-width constructions match Kolmogorov-Donoho optimal exponents only if a matching dictionary with polylog-depth implementations exists (Elbrächter et al., 2019).

7. Summary Table: Representative Error Decay Regimes

Network/Function Class Error Decay Rate Limiting Factor / Notes Reference
ReLU, depth LL, width nn O(nr), r2L/2O(n^{-r}),\ r\leq 2\lfloor L/2\rfloor Depth LL (sawtooth lower bound) (Gribonval et al., 2019)
Floor–ReLU, width NN, depth LL O(NαL)O(N^{-\alpha \sqrt{L}}) "Root-exponential" in L\sqrt{L} (Shen et al., 2020)
Deep (polylog-depth), MM params O(ecM)O(e^{-cM}) Exponential in parameter count (Elbrächter et al., 2019)
Two-layer, poly-decay activation O(n1/2)O(n^{-1/2}) Polynomial decay of activation (Siegel et al., 2019)
Bounded activation O(n1/4)O(n^{-1/4}) No decay; LL^\infty activation (Siegel et al., 2019)
Super-convergence (deep, Besov-Sobolev) O(N2s/d)O(N^{-2s/d}) Requires sufficient depth (DeVore et al., 2020)
Gradient descent, NTK regime O((1ηλi)t)O((1-\eta\lambda_i)^t) (geometric in tt) Controlled by leading eigenvalue (Su et al., 2019)
Multilevel adaptive net O(ρL)O(\rho^L) Geometric in levels (ρ<1\rho < 1) (Schütte et al., 2024)

References

A comprehensive understanding of error decay in network approximation thus integrates sharp theoretical decay rates, robust structural principles, links to classical smoothness theory, algorithmic and training effects, as well as explicit limitations imposed by both network and problem class properties.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Error Decay in Network Approximation.