Papers
Topics
Authors
Recent
2000 character limit reached

Kolmogorov–Arnold Theorem

Updated 7 December 2025
  • Kolmogorov–Arnold Theorem is a foundational result in multivariate function approximation, showing that any continuous function can be represented as a finite sum of univariate functions.
  • The theorem underpins neural network architectures like the Kolmogorov–Arnold Network (KAN), which efficiently mitigate the curse of dimensionality.
  • Modern refinements using higher-order B-splines and adaptive basis functions offer scalable error bounds and improved smoothness control in high-dimensional models.

The Kolmogorov–Arnold Theorem is a foundational result in the theory of multivariate function approximation, asserting that any continuous function of several variables can be represented as a finite superposition of univariate continuous functions. This superposition principle has profound consequences for both pure mathematics and applied fields such as neural network theory, function fitting, and high-dimensional modeling.

1. Mathematical Formulation and Statement

Let fC([0,1]n)f \in C([0,1]^n) be a continuous function on the nn-dimensional unit cube. The Kolmogorov–Arnold representation theorem states that there exist continuous univariate “inner” functions ψp,q:[0,1]R\psi_{p,q} : [0,1] \to \mathbb{R} for p=1,,np = 1, \dots, n, q=0,,2nq = 0, \dots, 2n, and continuous “outer” functions φq:RR\varphi_q : \mathbb{R} \to \mathbb{R}, such that for all x=(x1,,xn)[0,1]nx = (x_1, \ldots, x_n) \in [0,1]^n,

f(x1,,xn)=q=02nφq(p=1nψp,q(xp)).f(x_1, \ldots, x_n) = \sum_{q=0}^{2n} \varphi_q \left( \sum_{p=1}^n \psi_{p,q}(x_p) \right).

Each inner function ψp,q\psi_{p,q} depends only on xpx_p, and for each qq, the inner sum defines a univariate “ridge”, which is then transformed by the corresponding outer function φq\varphi_q. The minimality of $2n+1$ for the number of outer terms is a sharp result due to Arnold (Basina et al., 15 Nov 2024).

2. Constructive Proof and Basis Function Implementation

The classical proof constructs the representation via approximation of ff by step functions on a uniform grid in [0,1]n[0,1]^n. Each indicator function for a hypercube, 1[a1,b1]××[an,bn](x)1_{[a_1, b_1] \times \cdots \times [a_n, b_n]}(x), can be written as a composition Φ(p=1nH[ap,bp](xp))\Phi(\sum_{p=1}^n H_{[a_p,b_p]}(x_p)), where each H[ap,bp]H_{[a_p,b_p]} is a continuous hat function (first-order B-spline), and Φ\Phi is a univariate function mapping into {0,1}\{0,1\}, mollified for continuity. Summing over all grid cells yields the ridge-sum form above (Basina et al., 15 Nov 2024).

Modern expositions refine this approach by using higher-order B-splines to parameterize both inner and outer functions. This confers explicit control over smoothness and error and ties the theorem to the theory of spline approximation (Basina et al., 15 Nov 2024, Liu et al., 29 Mar 2025).

3. Escape from the Curse of Dimensionality

A central implication of the Kolmogorov–Arnold theorem is its decoupling of approximation error from input dimension. If ff is sufficiently smooth—admitting (k+1)(k+1)-times continuously differentiable inner and outer functions, each approximated by kk-th order B-spline interpolants on a grid of mesh-size $1/G$—the error satisfies

ffGCmCG(k+1m)\|f - f_G\|_{C^m} \leq C \cdot G^{-(k+1-m)}

for 0mk0 \leq m \leq k, with CC independent of nn. Doubling the 1D grid resolution GG halves the error, in contrast to the exponential scaling required by classical grid-based approximation. Thus, the K-A theorem provides true avoidance of the curse of dimensionality in this function representation context (Basina et al., 15 Nov 2024).

4. Kolmogorov–Arnold Network (KAN) Architecture

The explicit constructive representation motivates the Kolmogorov–Arnold Network (KAN) architecture. Each univariate ψp,q\psi_{p,q} is realized as a neural subnetwork (e.g., a B-spline layer or univariate MLP), with nn such functions per ridge unit. For q=0,,2nq = 0, \ldots, 2n, the sum Sq(x)=pψp,q(xp)S_q(x) = \sum_p \psi_{p,q}(x_p) is computed, passed through a univariate network implementing φq\varphi_q, and the $2n+1$ outputs are summed as in the theorem (Basina et al., 15 Nov 2024, Liu et al., 29 Mar 2025).

Table: KAN Layer Structure

Component Mathematical Role Realization
Input Layer x1,,xnx_1, \ldots, x_n Inputs
Inner Units ψp,q(xp)\psi_{p,q}(x_p) for each p,qp, q Univariate subnetworks
Ridge Sums p=1nψp,q(xp)\sum_{p=1}^n \psi_{p,q}(x_p) Ridge-aggregation
Outer Units φq()\varphi_q(\cdot) Univariate subnetworks
Output Layer q=02nφq()\sum_{q=0}^{2n} \varphi_q(\cdot) Summation

This architecture scales linearly in nn with respect to parameter count and empirically demonstrates dimension-independent error scaling for suitable spline order kk (Basina et al., 15 Nov 2024).

5. Extensions, Refinements, and Modern Realizations

Further refinements allow the inner and outer functions to be parameterized by various families, including sinusoidal bases (Gleyzer et al., 1 Aug 2025), orthogonal polynomials, or kernel methods (Liu et al., 29 Mar 2025). Variational approaches treat the number of basis functions as an adaptive latent variable optimized via variational inference, resulting in architectures such as the InfinityKAN (Alesiani et al., 3 Jul 2025). These adapt the basis complexity during training and enable practical universal function approximation without manual hyperparameter tuning.

A geometric extension develops symmetry-adapted KANs, enforcing invariance or equivariance under group actions such as O(n)O(n) or SnS_n, critical for physically meaningful modeling in molecular dynamics or particle physics. These architectures operate on invariant features—such as inner products xi,xj\langle x_i, x_j \rangle—in the univariate function blocks, preserving the group property by construction (Alesiani et al., 23 Feb 2025).

6. Theoretical Consequences and Comparison with Other Universal Approximation Results

Unlike the general universal approximation theorem for multilayer perceptrons, which asserts existence with no fixed network width, the K-A theorem gives an exact and finite decomposition for all continuous ff. The number of terms scales linearly with input dimension, and the approximation error can be arbitrarily reduced by refining only 1D structures (splines, Fourier, or other bases). By contrast, shallow networks require exponential width to avoid the curse of dimensionality in high dimensions.

For classes of smooth functions, error bounds for deep ReLU networks (using constructive K-A decompositions) exhibit polylogarithmic dependence on the dimension in the exponent, as opposed to the exponential scaling observed in non-structured network architectures (Montanelli et al., 2019). This establishes the K-A paradigm as a mathematically optimal blueprint for scalable, high-dimensional function fitting.

7. Broader Context and Impact

The Kolmogorov–Arnold theorem underpins a transformation in theoretical and applied function approximation. In machine learning, it justifies a class of architectures with provable dimension-robust performance. In approximation theory, it establishes a fundamental superposition principle for multivariate continuous functions. Modern neural architectures, including attention and kernel-based models, can be recast in the framework of linear combinations of kernelized univariate functions, unifying classical and deep learning approaches to high-dimensional approximation (Liu et al., 29 Mar 2025).

The explicit nature of the theorem and its network-inspired realizations continue to drive research into efficient, theoretically grounded learning architectures for domains requiring scalability, symmetry preservation, and interpretability in high-dimensional spaces (Basina et al., 15 Nov 2024, Alesiani et al., 23 Feb 2025, Alesiani et al., 3 Jul 2025).

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Kolmogorov–Arnold Theorem.