Universal Approximation Theorem for NNs

Updated 17 December 2025

Universal Approximation Theorem is a rigorous statement that neural networks with a single hidden layer and appropriate non-polynomial activations can approximate any continuous function on a compact set.
Methodological foundations include functional-analytic proofs via the Stone–Weierstrass theorem and measure-theoretic approaches, with quantitative bounds relating network size and approximation accuracy.
Extensions cover hypercomplex, quantum, dropout, and operator architectures, while addressing limitations like the curse of dimensionality and challenges in practical trainability.

A universal approximation theorem for neural networks is a rigorous mathematical statement asserting that a given neural network architecture—with mild constraints on its activation functions and suitable choice of width, depth, or algebraic structure—can approximate a broad class of target functions to arbitrary accuracy, in an appropriate function norm. This result provides the theoretical basis for much of deep learning, underpinning the expressive power and flexibility observed in practice.

1. Classical Foundations: Formulations and Proofs

The foundational universal approximation theorem (UAT) states that a feedforward neural network with a single hidden layer of sufficient width and a non-polynomial, continuous activation function (e.g., sigmoid, ReLU) is dense in $C(K)$ , the space of continuous functions on a compact set $K\subset\mathbb{R}^{n}$ . Explicitly, for any $f \in C(K)$ and $\varepsilon > 0$ , there exists a neural network of the form

$f_{NN}(x) = \sum_{j=1}^N a_j \sigma(w_j \cdot x + b_j)$

with $a_j, w_j, b_j \in \mathbb{R}$ such that

$\sup_{x \in K} |f(x) - f_{NN}(x)| < \varepsilon.$

The minimal requirements on $\sigma$ are that it is non-polynomial; this criterion is both necessary and sufficient (Leshno–Pinkus–Schocken). Two mainstream proof techniques are used: (1) a functional-analytic argument using the Stone–Weierstrass theorem and “discriminatory” properties of the activation; (2) a measure-theoretic approach relying on the separation of measures by ridge functions (Nishijima, 2021, Chong, 2020, Augustine, 17 Jul 2024).

Extension beyond real-valued functions is possible for hypercomplex-, vector-, and function-valued outputs, provided the appropriate algebraic conditions are satisfied (see §3).

2. Quantitative Approximation Rates and Parameter Complexity

The UAT is existential but can be strengthened with quantitative estimates for network architecture size:

For target functions of polynomial degree $d$ in $n$ variables, $N = \binom{n+d}{d}$ hidden units suffice (Chong, 2020).
For general continuous $f$ on $K \subset \mathbb{R}^n$ , $N = O(\varepsilon^{-n})$ is required to achieve sup-norm error $\varepsilon$ .
For functions in Barron-class (finite first Fourier moment), two-layer networks achieve $L^2$ error $O(N^{-1/2})$ , independent of input dimension, and no curse of dimensionality appears (Nishijima, 2021).
For Hamiltonian Deep Neural Networks, the same approximation bounds are recovered; e.g., for $f$ with Fourier moment $C_f<\infty$ , the $L^\infty$ error decays as $O(1/\sqrt{N})$ (Zakwan et al., 2023).

Constructive upper bounds have been established under strong weight constraints (e.g., last-layer weights arbitrarily small, first-layer randomly fixed, or only large weights allowed), and via explicit algebraic constructions (Chong, 2020).

3. Generalizations: Algebraic, Quantum, and Non-Euclidean Inputs

Universal approximation extends to much broader scenarios:

Hypercomplex- and vector-valued networks: If the network is constructed over a non-degenerate finite-dimensional real algebra (e.g., complex numbers, quaternions, Clifford, tessarines), the UAT extends verbatim. The only algebraic condition is that all multiplication bilinear forms are non-degenerate, ensuring the network can “steer” every output direction. Split-activation functions (component-wise scalar activations) guarantee density in $C(K, \mathbb{V})$ (Valle et al., 4 Jan 2024, Vital et al., 2022).
Topological vector space (TVS) inputs: For any TVS $X$ satisfying the Hahn–Banach extension property, a shallow network with hidden units of the form $\sigma(f(x) - \theta)$ , with $f$ in the continuous dual $X^*$ and $\sigma$ non-polynomial, is dense in $C(K)$ for every compact $K\subset X$ . This unifies classical multivariable, sequence-space, and function-space inputs under a single analytic framework (Ismailov, 19 Sep 2024).

Context	Key Algebraic Condition	Required Activation
$\mathbb{R}^n \to \mathbb{R}$	None beyond $C(K)$	Continuous, non-polynomial
$\mathbb{C}$ , $\mathbb{H}$ , $Cl$	Non-degenerate algebra	Split, non-polynomial
TVS inputs	Hahn–Banach Ext. Prop.	Continuous, non-polynomial

In each framework, the heart of universality is the richness of the ridge family $\{\sigma(f(x) - \theta)\}$ in separating points and spanning the ambient space, together with algebraic non-degeneracy guaranteeing all directions can be realized.

4. Architectural and Algorithmic Variants

A series of results demonstrate robustness of the universal approximation property across training regimes, architectures, function classes, and algebraic settings:

Deep networks: Depth–width trade-offs clarify that arbitrary depth with minimal width can also be universal, provided width exceeds a threshold (e.g., $n+1$ for $n$ inputs in ReLU nets) (Augustine, 17 Jul 2024).
Operator neural networks: Arbitrary-depth, width-$5$ NNs can approximate continuous nonlinear operators between function spaces when the activation is non-polynomial and $C^1$ at some point (Yu et al., 2021).
Quantum neural networks: Parameterized quantum circuits can approximate Barron-class functions with $O(\varepsilon^{-2})$ tunable parameters and $O(\log(1/\varepsilon))$ qubits, with explicit $L^2$ error bounds (Gonon et al., 2023).
Dropout and binarization: UATs extend to random dropout NNs (in both stochastic and expectation-modes) and to Binarized NNs (BNNs). BNNs are universal on $\{\pm1\}^d$ input in one layer, but require two layers for real-valued, Lipschitz-continuous targets on compacts (Manita et al., 2020, Yayla et al., 2021).
One-bit (quantized) networks: For any $f\in C^s([0,1]^d)$ with $\|f\|_\infty<1$ , one-bit quadratic or ReLU networks can approximate $f$ to $\varepsilon$ (away from the boundary) with $O(\varepsilon^{-2d/s})$ (quadratic) or $O(\varepsilon^{-2d/s}\log(1/\varepsilon))$ (ReLU) parameters (Güntürk et al., 2021).
Floating-point computation: Floating-point NNs retain universal expressivity: for every rounded target function $f$ , there exists a floating-point NN whose (interval) output matches the direct image map of $f$ over any box, exactly capturing program semantics at the bit-level (Hwang et al., 19 Jun 2025).

5. Beyond Uniform Approximation: Uniformity over Measures and Probability Distributions

The classical UAT concerns approximation in sup-norm over compact domains. Recent results broaden this to:

Robust UAT in Orlicz spaces: Neural networks are dense in Orlicz spaces $L^\Phi(\mu)$ and the approximation property can be made uniform over weakly compact sets of probability measures $\mathcal{M}$ (distributional robustness). This includes standard $L^p$ as special cases and permits uniform error control over model/dataset shifts, adversarial perturbations, or heavy-tailed data (Ceylan et al., 10 Oct 2025).
Distributional UAT: Neural networks can approximate push-forward transformations between probability distributions. For any target law $\pi$ and source $p_z$ on $\mathbb{R}^d$ , there exists a deep ReLU potential $g$ such that $(\nabla g)_{\#}p_z$ is arbitrarily close to $\pi$ in Wasserstein ( $W_1$ ), MMD, or KSD metrics. Finite network width and depth bounds are explicit, with exponential scaling in $d$ for $W_1$ but only polynomial in $d$ for MMD/KSD (Lu et al., 2020). This elevates the classical UAT from function- to measure-approximation.

6. Limitations, Open Problems, and Structural Results

The UAT framework does not imply practical trainability or feasible network size for high-dimensional domains—curse of dimensionality remains for generic continuous targets, and quantitative rates depend crucially on smoothness and function class assumptions. Open areas include:

Explicit rates for deep, narrow, or functional-input architectures.
Comprehensive if-and-only-if conditions for activation functions in hypercomplex and function space cases (Valle et al., 4 Jan 2024).
Sharp algebraic characterization of how and when quantization, dropout, or architectural constraints degrade expressivity or only increase parameter requirements, with nontrivial gaps persisting for, e.g., binarized or finite-alphabet networks (Güntürk et al., 2021, Yayla et al., 2021, Augustine, 17 Jul 2024).
Analytical and constructive methods for measure-level (distributional) approximation, especially for non-compact domains and under adversarial/distributional shifts (Ceylan et al., 10 Oct 2025, Lu et al., 2020).

7. Summary and Impact

The universal approximation theorem has evolved into a comprehensive set of results covering classical shallow networks, algebraically structured models (hypercomplex, Clifford), operator and quantum circuits, robust and distributional approximation, and computation-constrained (binarized, floating-point, dropout) regimes. The theorems delineate the algebraic, analytic, and computational boundaries of representational power, guide the selection of architectures and activations for specific tasks, and anchor the expressive capacity of neural architectures in both theoretical and practical settings (Nishijima, 2021, Gonon et al., 2023, Valle et al., 4 Jan 2024, Lu et al., 2020, Ceylan et al., 10 Oct 2025, Güntürk et al., 2021).