Universal Approximation Property (UAP)
- UAP is defined as the ability of a neural or dynamical system to approximate any target function in a function space arbitrarily well.
- It is validated across various settings such as Lp, Sobolev, and Orlicz spaces, often requiring non-polynomial activations and tailored architectural designs.
- Modern extensions of UAP are seen in transformers, CNNs, residual networks, and neural ODEs, emphasizing its critical role in the expressive power of machine learning models.
The universal approximation property (UAP) asserts that a given class of neural or dynamical networks is dense in an appropriate function space, i.e., any target function in that space can be approximated arbitrarily well by members of the class. UAP is fundamental in both the theoretical analysis and practical design of machine learning architectures, as it underlies the expressive power of these models across a wide spectrum of function spaces—including -spaces, spaces, Sobolev and Orlicz spaces, and equivariant or dynamical settings.
1. Definitions and Core Principles
UAP is formally defined relative to topology or norm on a function space (e.g., the uniform norm in or the -norm in ). A hypothesis class has the UAP if for every and every , there exists such that 0. For neural networks, standard results show that single-hidden-layer feedforward networks with non-polynomial activation functions possess the UAP on 1 and 2 spaces over compact 3 (Chong, 2020).
Recent developments extend this paradigm to a multitude of settings:
- Weighted, non-compact domains via weighted 4 and Sobolev spaces (Neufeld et al., 2024).
- Sequence-to-sequence and equivariant architectures (e.g., transformers under 5 norms with group symmetry) (Cheng et al., 30 Jun 2025).
- Functional input, random feature, or Banach space-valued models (Neufeld et al., 2023).
- Dynamical systems and control-inspired families (including residual networks, neural ODEs, and invertible flows) (Duan et al., 2023, Ishikawa et al., 2022, Marinis et al., 19 Mar 2025).
The property is robust to numerous architectural modifications (depth, random initialization, skip connections, sparsity) and holds under surprisingly severe constraints on weights and layer norms (Chong, 2020, Ceylan et al., 10 Oct 2025).
2. Classical UAP Results and Generalizations
The classical UAP for shallow neural networks is well-understood: For continuous, non-polynomial activation 6, the class of one-hidden-layer networks 7 is dense in 8 for any compact 9 (Chong, 2020).
Explicit quantitative results are available:
- For any 0 Lipschitz on 1, 2 hidden units suffice to achieve uniform error 3 (Chong, 2020).
- For polynomial targets of degree 4, the minimal number of units required is combinatorial in 5 and 6: 7, independent of the output dimension (Chong, 2020).
- For 8-spaces and variable-exponent Lebesgue norms 9, UAP holds if and only if 0 is essentially bounded (Capel et al., 2020).
Modern results sharpen these by establishing UAP on:
- Non-compact domains, using weighted 1 and Sobolev norms, where arbitrary polynomial growth is controlled via weighted Banach topologies. Non-polynomial 2 remains necessary and sufficient (Neufeld et al., 2024).
- Orlicz spaces, particularly for distributionally robust learning under weakly compact classes of measures, including beyond-3 regimes (Ceylan et al., 10 Oct 2025).
3. UAP in Specialized and Symmetry-Constrained Architectures
Generalization to architectures with strong structural constraints has become an active research area.
3.1. Transformers and Equivariant Models
Universal approximation for transformer-type models (attention-based or residual) is characterized by two key criteria:
- Universal single-token nonlinearity: The class of feedforward maps must itself possess the classical UAP on 4.
- Token distinguishability of mixing layers: The attention or token-mixing mechanism must be able, after a finite sequence of layers, to scatter any pair of non-equivalent sequences to outputs with disjoint multisets of tokens, generalized to 5-equivariance under sequence symmetry (Cheng et al., 30 Jun 2025).
For softmax, RBF, and random-feature attention mechanisms—provided the kernel is real-analytic and satisfies certain scaling distinguishability conditions—the full transformer enjoys UAP in the 6 norm, often with a minimal number of layers (Cheng et al., 30 Jun 2025).
3.2. Convolutional, Input-Connected, and Residual Architectures
Fully convolutional neural networks (CNNs) with zero padding are UAP for tensor-to-tensor maps, provided the depth or intermediate channel width meets dimension-dependent lower bounds; translation equivariance is controlled via boundary effects enabled by zero padding (Hwang et al., 2022). Input-connected multilayer perceptrons (IC-MLPs), where each neuron receives both recursive and raw input, achieve UAP for any continuous, non-affine activation (Ismailov, 20 Jan 2026). ResNet, ODE-Net, and related residual architectures have UAP provided the width meets a model-dependent critical threshold, often 7 (Aizawa et al., 2020).
3.3. Minimum Width and Topological Constraints
Sharp minimum width requirements for UAP are governed by input and output dimension as well as topological embedding theory:
- 8-UAP for leaky-ReLU nets requires width 9 (Cai, 2022).
- Uniform (0-UAP) generally requires extra dimensions: 1, where 2 is the minimal number of additional coordinates required for the graph of any target function to be embedded as an orientation-preserving diffeomorphism; typically, 3 (Li et al., 2023).
- These thresholds are necessary; sub-critical width precludes UAP due to invariance along invisible directions or inability to reach nontrivial submanifolds in the target space (Cai, 2022).
4. UAP in Dynamical Systems, ODEs, SDEs, and Control Families
UAP extends to flows of dynamical systems parameterized by neural networks, encompassing:
- Controlled ODEs with minimal control families. The family of flows generated by compositions of affine vector fields and a single nonlinear generator (e.g., ReLU) is minimal and sufficient for UAP on orientation-preserving diffeomorphisms for 4 (Duan et al., 2023).
- Residual networks and neural ODEs viewed as time-discretized flows. Neural ODE activation networks are UAP in 5 provided the underlying activation is non-polynomial and satisfies some mild regularity conditions; extending to SDEs is possible under linear growth constraints (Marinis et al., 19 Mar 2025, Kwossek et al., 20 Mar 2025).
- Neural DDEs (delay differential equations) introduce a “memory capacity” parameter 6 (Lipschitz constant times delay). UAP for DDE-based models only holds above a memory threshold; below this, the model class is dynamically restricted and cannot approximate non-monotone maps (Kuehn et al., 12 May 2025).
The connection to control theory is explicit: UAP for flow-generated models follows from density results for the Lie algebra generated by the control family, augmented by affine invariance or Lie-bracket generation (Cai et al., 4 Oct 2025, Duan et al., 2023).
5. UAP Extensions: Random Features, Robust and Non-Compact Settings
The UAP generalizes naturally to:
- Random feature models and Banach-valued function spaces: Randomly initialized single-layer networks with only the output layer trained have UAP in any Bochner space 7 over separable Banach 8, provided the activation is non-polynomial and the feature distribution has full support (Neufeld et al., 2023). This result covers weighted 9, Sobolev, and even path-space-valued functions.
- Distributionally robust approximation: Neural networks possess UAP in Orlicz spaces and, crucially, the approximation is uniform over any weakly compact family of measures—even far beyond the standard 0 setting (Ceylan et al., 10 Oct 2025).
- Variable-exponent and weighted spaces: In 1 spaces, UAP by shallow neural networks occurs if and only if the exponent 2 is essentially bounded; when unbounded, only those functions converging at infinity (in a quotient-norm sense) are approximable (Capel et al., 2020).
Weighted and non-compact settings pose no barrier to UAP provided proper growth controls on the function class and corresponding polynomially-weighted norms are included (Neufeld et al., 2024).
6. Theoretical Structures, Minimality, and Open Problems
A profound insight is that UAP is fundamentally a topological/dynamical property: any universal approximator can be characterized via topologically transitive (Birkhoff-hypercyclic) operators whose orbits densely cover the function space (Kratsios, 2019). This opens avenues for:
- Constructing minimal architectures with a single nonlinear “gate” and linear dynamical operator sufficient for UAP under mild barycentric conditions (Kratsios, 2019).
- Analyzing flows on diffeomorphism groups, where composition operators with minimal control family (affines + one nonlinearity) suffice for density in 3 or 4 on any compact (Duan et al., 2023).
- Investigating minimal sufficient symmetry or nonlinearity (e.g., necessity of non-polynomial, non-affine, nonmonotonic activation in the UAP context) (Chong, 2020, Ismailov, 20 Jan 2026).
- Quantitative estimation of rates and explicit complexity bounds in high-dimensional and Sobolev contexts (Neufeld et al., 2024, Neufeld et al., 2023).
Open directions include rates of approximation under dimensionality constraints, extension to deep models in variable-exponent or non-metrizable spaces, minimal activations and architectures for critical-width UAP, and the interplay with optimization and learnability constraints in random or quantized settings (Capel et al., 2020, Cai, 2022, Ceylan et al., 10 Oct 2025).
References
- "Approximation with Neural Networks in Variable Lebesgue Spaces" (Capel et al., 2020)
- "A unified framework on the universal approximation of transformer-type architectures" (Cheng et al., 30 Jun 2025)
- "A closer look at the approximation capabilities of neural networks" (Chong, 2020)
- "Achieve the Minimum Width of Neural Networks for Universal Approximation" (Cai, 2022)
- "Minimum Width of Leaky-ReLU Neural Networks for Uniform Universal Approximation" (Li et al., 2023)
- "Universal approximation results for neural networks with non-polynomial activation function over non-compact domains" (Neufeld et al., 2024)
- "Universal Approximation Property of Banach space-valued random feature models..." (Neufeld et al., 2023)
- "Distributionally robust approximation property of neural networks" (Ceylan et al., 10 Oct 2025)
- "Universal Approximation Property of Fully Convolutional Neural Networks with Zero Padding" (Hwang et al., 2022)
- "Universal Approximation Theorem for Input-Connected Multilayer Perceptrons" (Ismailov, 20 Jan 2026)
- "Universal approximation property of invertible neural networks" (Ishikawa et al., 2022)
- "Universal approximation property of neural stochastic differential equations" (Kwossek et al., 20 Mar 2025)
- "The Influence of the Memory Capacity of Neural DDEs on the Universal Approximation Property" (Kuehn et al., 12 May 2025)
- "A Minimal Control Family of Dynamical Systems for Universal Approximation" (Duan et al., 2023)
- "Achieving Universal Approximation and Universal Interpolation via Nonlinearity of Control Families" (Cai et al., 4 Oct 2025)
- "Approximation properties of neural ODEs" (Marinis et al., 19 Mar 2025)
- "NEU: A Meta-Algorithm for Universal UAP-Invariant Feature Representation" (Kratsios et al., 2018)
- "The Universal Approximation Property" (Kratsios, 2019)
- "Universal Approximation Properties for an ODENet and a ResNet: Mathematical Analysis and Numerical Experiments" (Aizawa et al., 2020)