Kolmogorov–Arnold Theorem

Updated 7 December 2025

Kolmogorov–Arnold Theorem is a foundational result in multivariate function approximation, showing that any continuous function can be represented as a finite sum of univariate functions.
The theorem underpins neural network architectures like the Kolmogorov–Arnold Network (KAN), which efficiently mitigate the curse of dimensionality.
Modern refinements using higher-order B-splines and adaptive basis functions offer scalable error bounds and improved smoothness control in high-dimensional models.

The Kolmogorov–Arnold Theorem is a foundational result in the theory of multivariate function approximation, asserting that any continuous function of several variables can be represented as a finite superposition of univariate continuous functions. This superposition principle has profound consequences for both pure mathematics and applied fields such as neural network theory, function fitting, and high-dimensional modeling.

1. Mathematical Formulation and Statement

Let $f \in C([0,1]^n)$ be a continuous function on the $n$ -dimensional unit cube. The Kolmogorov–Arnold representation theorem states that there exist continuous univariate “inner” functions $\psi_{p,q} : [0,1] \to \mathbb{R}$ for $p = 1, \dots, n$ , $q = 0, \dots, 2n$ , and continuous “outer” functions $\varphi_q : \mathbb{R} \to \mathbb{R}$ , such that for all $x = (x_1, \ldots, x_n) \in [0,1]^n$ ,

$f(x_1, \ldots, x_n) = \sum_{q=0}^{2n} \varphi_q \left( \sum_{p=1}^n \psi_{p,q}(x_p) \right).$

Each inner function $\psi_{p,q}$ depends only on $x_p$ , and for each $q$ , the inner sum defines a univariate “ridge”, which is then transformed by the corresponding outer function $\varphi_q$ . The minimality of $2n+1$ for the number of outer terms is a sharp result due to Arnold (Basina et al., 15 Nov 2024).

2. Constructive Proof and Basis Function Implementation

The classical proof constructs the representation via approximation of $f$ by step functions on a uniform grid in $[0,1]^n$ . Each indicator function for a hypercube, $1_{[a_1, b_1] \times \cdots \times [a_n, b_n]}(x)$ , can be written as a composition $\Phi(\sum_{p=1}^n H_{[a_p,b_p]}(x_p))$ , where each $H_{[a_p,b_p]}$ is a continuous hat function (first-order B-spline), and $\Phi$ is a univariate function mapping into $\{0,1\}$ , mollified for continuity. Summing over all grid cells yields the ridge-sum form above (Basina et al., 15 Nov 2024).

Modern expositions refine this approach by using higher-order B-splines to parameterize both inner and outer functions. This confers explicit control over smoothness and error and ties the theorem to the theory of spline approximation (Basina et al., 15 Nov 2024, Liu et al., 29 Mar 2025).

3. Escape from the Curse of Dimensionality

A central implication of the Kolmogorov–Arnold theorem is its decoupling of approximation error from input dimension. If $f$ is sufficiently smooth—admitting $(k+1)$ -times continuously differentiable inner and outer functions, each approximated by $k$ -th order B-spline interpolants on a grid of mesh-size $1/G$—the error satisfies

$\|f - f_G\|_{C^m} \leq C \cdot G^{-(k+1-m)}$

for $0 \leq m \leq k$ , with $C$ independent of $n$ . Doubling the 1D grid resolution $G$ halves the error, in contrast to the exponential scaling required by classical grid-based approximation. Thus, the K-A theorem provides true avoidance of the curse of dimensionality in this function representation context (Basina et al., 15 Nov 2024).

4. Kolmogorov–Arnold Network (KAN) Architecture

The explicit constructive representation motivates the Kolmogorov–Arnold Network (KAN) architecture. Each univariate $\psi_{p,q}$ is realized as a neural subnetwork (e.g., a B-spline layer or univariate MLP), with $n$ such functions per ridge unit. For $q = 0, \ldots, 2n$ , the sum $S_q(x) = \sum_p \psi_{p,q}(x_p)$ is computed, passed through a univariate network implementing $\varphi_q$ , and the $2n+1$ outputs are summed as in the theorem (Basina et al., 15 Nov 2024, Liu et al., 29 Mar 2025).

Table: KAN Layer Structure

Component	Mathematical Role	Realization
Input Layer	$x_1, \ldots, x_n$	Inputs
Inner Units	$\psi_{p,q}(x_p)$ for each $p, q$	Univariate subnetworks
Ridge Sums	$\sum_{p=1}^n \psi_{p,q}(x_p)$	Ridge-aggregation
Outer Units	$\varphi_q(\cdot)$	Univariate subnetworks
Output Layer	$\sum_{q=0}^{2n} \varphi_q(\cdot)$	Summation

This architecture scales linearly in $n$ with respect to parameter count and empirically demonstrates dimension-independent error scaling for suitable spline order $k$ (Basina et al., 15 Nov 2024).

Further refinements allow the inner and outer functions to be parameterized by various families, including sinusoidal bases (Gleyzer et al., 1 Aug 2025), orthogonal polynomials, or kernel methods (Liu et al., 29 Mar 2025). Variational approaches treat the number of basis functions as an adaptive latent variable optimized via variational inference, resulting in architectures such as the InfinityKAN (Alesiani et al., 3 Jul 2025). These adapt the basis complexity during training and enable practical universal function approximation without manual hyperparameter tuning.

A geometric extension develops symmetry-adapted KANs, enforcing invariance or equivariance under group actions such as $O(n)$ or $S_n$ , critical for physically meaningful modeling in molecular dynamics or particle physics. These architectures operate on invariant features—such as inner products $\langle x_i, x_j \rangle$ —in the univariate function blocks, preserving the group property by construction (Alesiani et al., 23 Feb 2025).

6. Theoretical Consequences and Comparison with Other Universal Approximation Results

Unlike the general universal approximation theorem for multilayer perceptrons, which asserts existence with no fixed network width, the K-A theorem gives an exact and finite decomposition for all continuous $f$ . The number of terms scales linearly with input dimension, and the approximation error can be arbitrarily reduced by refining only 1D structures (splines, Fourier, or other bases). By contrast, shallow networks require exponential width to avoid the curse of dimensionality in high dimensions.

For classes of smooth functions, error bounds for deep ReLU networks (using constructive K-A decompositions) exhibit polylogarithmic dependence on the dimension in the exponent, as opposed to the exponential scaling observed in non-structured network architectures (Montanelli et al., 2019). This establishes the K-A paradigm as a mathematically optimal blueprint for scalable, high-dimensional function fitting.

7. Broader Context and Impact

The Kolmogorov–Arnold theorem underpins a transformation in theoretical and applied function approximation. In machine learning, it justifies a class of architectures with provable dimension-robust performance. In approximation theory, it establishes a fundamental superposition principle for multivariate continuous functions. Modern neural architectures, including attention and kernel-based models, can be recast in the framework of linear combinations of kernelized univariate functions, unifying classical and deep learning approaches to high-dimensional approximation (Liu et al., 29 Mar 2025).

The explicit nature of the theorem and its network-inspired realizations continue to drive research into efficient, theoretically grounded learning architectures for domains requiring scalability, symmetry preservation, and interpretability in high-dimensional spaces (Basina et al., 15 Nov 2024, Alesiani et al., 23 Feb 2025, Alesiani et al., 3 Jul 2025).