Universal Approximation Theorems

Updated 3 December 2025

Universal Approximation Theorems are foundational results proving that neural, spin, and signature models can approximate any continuous function on compact domains given sufficient capacity.
They extend classical theorems like Weierstrass and Stone–Weierstrass, applying advanced functional analysis and algebraic tools to modern architectures including operator and geometric networks.
Quantitative bounds and architectural variances highlight both the potential and limitations in model expressivity, addressing challenges such as the curse of dimensionality.

A universal approximation theorem asserts that a broad class of parameterizable function models—most notably neural networks, but also signature and spin models—can approximate any target in a prescribed function class (such as continuous functions on a compact domain) to arbitrary precision, given sufficient model capacity. These results are foundational for neural network theory, signature methods for time series, operator learning, spin-based probabilistic models, quantum circuits, and geometric deep learning. They also provide crucial bridges between applied learning architectures and the classical theory of function spaces.

1. Classical Universal Approximation Theorems

The archetypal universal approximation theorem for neural networks—in the form of the Cybenko, Hornik, and Leshno–Lin–Pinkus–Schocken theorems—states that shallow feedforward neural networks with non-polynomial activation functions are uniformly dense in the set of continuous functions on compact subsets of Euclidean space. Specifically, for $f \in C(K)$ , $K \subset \mathbb{R}^n$ compact, and non-polynomial, continuous $\sigma$ , sums of the form $\sum_{i=1}^N a_i \sigma(w_i^T x + \theta_i)$ can approximate $f$ in the sup-norm to arbitrary accuracy (Chong, 2020, Nishijima, 2021, Augustine, 17 Jul 2024). The necessity of non-polynomiality (i.e., polynomial activations cannot be universal) holds both in finite and infinite-dimensional settings (Bilokopytov et al., 27 Jul 2025).

These results extend earlier classical approximation statements, including the Weierstrass theorem (polynomials are dense in $C([a,b])$ ), Stone–Weierstrass theorems (density of suitable subalgebras), and Kolmogorov–Arnold's superposition theorem. In modern settings, the algebraic structure of the parameterized families (e.g., ridge functions, shallow networks) is exploited to obtain density, often utilizing the Hahn–Banach theorem, the Riesz representation theorem, and moment-theoretic arguments (Augustine, 17 Jul 2024).

Recent generalizations include universal approximation for neural nets with inputs from general topological vector spaces (Ismailov, 19 Sep 2024), manifolds (Kratsios et al., 2021), Banach and operator spaces (Bilokopytov et al., 27 Jul 2025), and even signature feature spaces of càdlàg paths (Cuchiero et al., 2022).

2. Structural Generalizations and New Domains

Universal approximation is now established for input domains and function spaces well beyond $\mathbb{R}^n$ . The theorem of (Bilokopytov et al., 27 Jul 2025) establishes universality for the class

$\operatorname{span} \{ \varphi \circ g : g \in \operatorname{Aff}(E) \}$

dense in $C(E)$ (with $E$ a Hausdorff topological vector space) for non-polynomial $\varphi$ , recovering and extending one-dimensional results.

In operator learning settings, bounded-width, arbitrary-depth operator neural networks are universal for continuous nonlinear operators between function spaces, provided the nonlinearity is non-polynomial and differentiable at a point with nonzero derivative (Yu et al., 2021).

For vector-valued, hypercomplex, and Clifford-valued networks, UAT holds under the key algebraic condition of non-degeneracy (that each structural bilinear form is non-degenerate), showing that architectures over $\mathbb{C}$ , $\mathbb{H}$ , tessarines, and Clifford algebras possess the same density properties as real-valued networks when activations are sufficiently rich (Valle et al., 4 Jan 2024, Vital et al., 2022).

Neural networks defined on infinite-dimensional or sequence spaces (e.g., $L^p$ , $\ell^p$ , $c_0$ , $C(Y)$ ) achieve universal approximation if the activation is non-polynomial on an open interval. This is proved using the density of exponential functionals and a Stone–Weierstrass theorem in the infinite-dimensional setting (Ismailov, 19 Sep 2024).

Geometric deep learning (GDL) models on manifolds achieve universality locally—with explicit dependencies on curvature, injectivity radii, and modulus of continuity. For Cartan–Hadamard manifolds (nonpositive curvature), universality holds globally, while positive-curvature manifolds or nontrivial topology impose topological obstructions unless the function is null-homotopic (Kratsios et al., 2021).

3. Architecture Variants and Activation Design

Universal approximation holds for a broad spectrum of model architectures and activation functions:

Width and depth: Shallow, wide networks (single hidden layer, arbitrary width) achieve universality with non-polynomial activation (Augustine, 17 Jul 2024). Conversely, arbitrarily deep but narrow (width $n+4$ ) ReLU networks achieve $L^1$ -universality for integrable functions (Augustine, 17 Jul 2024). For operator neural networks, arbitrary depth with width as low as $5$ suffices (Yu et al., 2021).
Binarized and quantized networks: Fully-connected Binarized Neural Networks (BNNs) are universal on $\{\pm1\}^d$ with one hidden layer, but for real inputs require two layers for Lipschitz functions. Universality comes with exponential width in input dimension (Yayla et al., 2021). For one-bit neural networks (with parameters limited to $\pm1$ or $\pm 1/2$ ), $C^s$ -smooth functions can be approximated up to $\varepsilon$ -accuracy away from boundaries using $O(\varepsilon^{-2d/s})$ parameters, with explicit implementations for both quadratic and ReLU activations (Güntürk et al., 2021).
Dropout networks: Universal approximation holds both in random operation mode (approximating in probability and $L^q$ ) and expectation-replacement deterministic mode for dropout neural networks, establishing that dropout does not impede universal approximation (Manita et al., 2020).
Mixture-of-experts: MoE models with softmax gating and affine experts are dense in $C(K)$ over any compact $K$ , without requiring smoothness or special geometric domains (Nguyen et al., 2016).
Spin models and Boltzmann machines: Classes such as RBMs, DBMs, and DBNs are universal approximators for probability distributions over discrete spaces, with universality characterized by "flag completeness" and closure under nonnegative combinations (Reinhart et al., 10 Jul 2025).
Signature models: For functionals of càdlàg or Lévy-type paths, any continuous functional can be uniformly approximated on compact sets (in the Skorokhod topology) by linear functionals of time-extended signatures, yielding universal signature models for path-dependent functionals (Cuchiero et al., 2022).

4. Quantitative Rates, Curse of Dimensionality, and Expressivity

Explicit rates for the number of hidden units or parameters required for a given accuracy $\varepsilon$ have been established in various settings:

For standard shallow networks approximating continuous functions on $\mathbb{R}^n$ , $O(\varepsilon^{-n})$ hidden units suffice (Chong, 2020).
For Sobolev regularity $C^m$ , the best $n$ -unit error scales as $O(n^{-m/r})$ with input dimension $r$ (Nishijima, 2021).
For Barron-space targets (i.e., where $f$ can be written as an integral over parameterized ridge functions with finite total variation), the error rate is $O(n^{-1/2})$ —independent of input dimension—making such functions "dimension-free" with respect to curse of dimensionality (Nishijima, 2021).
One-bit or binary networks incur a penalty: size scales as $O(\varepsilon^{-2d/s})$ in $d$ dimensions for $C^s$ -functions (Güntürk et al., 2021).
For quantum neural networks and quantum reservoirs, $O(\varepsilon^{-2})$ weights and $O(\log_2(\varepsilon^{-1}))$ qubits suffice for $L^2$ -approximation of functions with integrable Fourier transform. These quantitative results mirror Barron-type classical neural network rates (Gonon et al., 2023).
In the geometric manifold setting, the universality radius and required depth are controlled in terms of curvature, manifold injectivity, and modulus of continuity; the curse is broken on finite (efficient) datasets (Kratsios et al., 2021).

In operator learning, arbitrary-depth, small-width NNs can achieve universality over operator spaces, and there are corresponding depth-vs-width separation theorems: for certain functionals, a deep network of constant width is exponentially more efficient than any shallow architecture (Yu et al., 2021).

5. Advanced and Specialized Universal Approximation Results

Universal approximation theorems have been extended or adapted to a variety of modern architectures and requirements:

Equivariant networks: For group equivariant convolutional architectures (including CNNs for group actions, DeepSets, SE(3)-CNNs), universal approximation of continuous equivariant maps is achieved with as few as two layers for suitable activations and group choices, including infinite-dimensional and non-compact settings (Kumagai et al., 2020).
Vector lattices and nonlinear algebraic structures: Non-polynomial activations yield density in continuous function spaces over infinite-dimensional vector lattices, with corollaries for positive-part maps and sublattice density in Banach lattices (Bilokopytov et al., 27 Jul 2025).
Floating-point and interval semantics: Floating-point networks (using only finitely precise arithmetic) are provably universal for interval-abstract semantics, exactly matching the direct image map of any rounded target function. This universality persists even with minimal activation requirements (e.g., identity activation suffices), and encompasses interval-completeness for straight-line floating-point programs (Hwang et al., 19 Jun 2025).
Signature-based functionals: In path space, any continuous functional of the signature of a (time-augmented) càdlàg path can be uniformly approximated using linear functionals of the signature, extending universality to nonanticipative functionals over compact subsets of path spaces. Models based on signatures of Lévy processes retain explicit tractability and permit exact closed-form pricing and hedging in mathematical finance (Cuchiero et al., 2022).

6. Technical Methods, Proof Tools, and Limitations

Central technical methodologies recurring in proofs of universal approximation theorems include:

Stone–Weierstrass theorem: By constructing dense subalgebras or algebras generated by parameterized activation, exponential or indicator functions, often showing separation of points and closure under multiplication (Bilokopytov et al., 27 Jul 2025, Cuchiero et al., 2022, Ismailov, 19 Sep 2024).
Algebraic independence and generalized Vandermonde/Wronskian criteria: Used to guarantee that a suitable set of neuron outputs can interpolate any polynomial target and hence bootstrap to dense approximation of continuous functions (Chong, 2020).
Convolution-mollifier arguments: Used to pass from smooth (polynomial) approximation to general continuous targets (Bilokopytov et al., 27 Jul 2025, Ismailov, 19 Sep 2024).
Slicing and functional reduction: Reduction of infinite-dimensional problems to families of finite-dimensional ones via "slicing" through affine subspaces (Bilokopytov et al., 27 Jul 2025).
Encoding/decoding and truncation for operator nets: Efficiently encoding inputs to control dimension blow-up and achieving narrow depth-universal architectures (Yu et al., 2021).
Probability and functional analysis: For dropout models, exploiting linearity of expectation, variance concentration, and matching-in-expectation algebraic identities, as well as layerwise law-of-large-numbers constructions (Manita et al., 2020).

While universal approximation is extremely general, its existential character limits direct quantitative insight except where explicit rates have been established; many results do not guarantee efficient or practical network sizes except for functions of substantial smoothness or special structure (e.g., Barron class or efficient datasets). In high dimensions, "curse of dimensionality" bounds often apply to generic continuous functions unless additional function-space structure is available. Topological or geometric obstructions in non-Euclidean settings can preclude universality globally, but are circumvented by suitable architecture adaptations or localization (Kratsios et al., 2021).

7. Broader Impact and Unifying Principles

Universal approximation theorems provide the formal mathematical underpinning for the use of neural, quantum, signature-based, and spin/energy-based models in arbitrary function learning, generative modeling, operator regression, and time-series modeling. They establish that—independent of optimization, training, and generalization—there exist configurations of model parameters capable of matching any function of practical interest to prescribed accuracy, modulo architectural and functional assumptions.

Unification across function classes, activation types, input and output domains, and architectural innovations shows the robustness of universal approximation: for virtually any space of "reasonable" functions, sufficiently expressive parameterized models can, in principle, recover its elements up to any desired precision (Augustine, 17 Jul 2024, Ismailov, 19 Sep 2024, Bilokopytov et al., 27 Jul 2025). Extensions to probabilistic, vector-valued, hypercomplex, geometric, operator, and path-function spaces further solidify universal approximation as a central and motivating theoretical result in machine learning and beyond.