Spontaneous Kolmogorov-Arnold Geometry in Shallow MLPs (2509.12326v1)

Published 15 Sep 2025 in cs.LG, cond-mat.str-el, and hep-th

Abstract: The Kolmogorov-Arnold (KA) representation theorem constructs universal, but highly non-smooth inner functions (the first layer map) in a single (non-linear) hidden layer neural network. Such universal functions have a distinctive local geometry, a "texture," which can be characterized by the inner function's Jacobian $J({\mathbf{x}})$, as $\mathbf{x}$ varies over the data. It is natural to ask if this distinctive KA geometry emerges through conventional neural network optimization. We find that indeed KA geometry often is produced when training vanilla single hidden layer neural networks. We quantify KA geometry through the statistical properties of the exterior powers of $J(\mathbf{x})$: number of zero rows and various observables for the minor statistics of $J(\mathbf{x})$, which measure the scale and axis alignment of $J(\mathbf{x})$. This leads to a rough understanding for where KA geometry occurs in the space of function complexity and model hyperparameters. The motivation is first to understand how neural networks organically learn to prepare input data for later downstream processing and, second, to learn enough about the emergence of KA geometry to accelerate learning through a timely intervention in network hyperparameters. This research is the "flip side" of KA-Networks (KANs). We do not engineer KA into the neural network, but rather watch KA emerge in shallow MLPs.

Summary

The paper demonstrates that KA geometry, marked by zero rows in the Jacobian, naturally emerges in shallow MLPs when trained on nonlinear functions.
It employs statistical measures such as participation ratios, random rotation ratios, and column divergences to rigorously quantify the KA structure.
Results reveal that optimal training regimes trigger sharp emergence of KA geometry, offering promising directions for interventions in deep learning architectures.

Emergence of Kolmogorov-Arnold Geometry in Shallow MLPs

Introduction and Motivation

The paper investigates the spontaneous emergence of Kolmogorov-Arnold (KA) geometry in shallow multilayer perceptrons (MLPs), specifically single hidden layer networks trained via conventional optimization. The Kolmogorov-Arnold representation theorem guarantees that any continuous multivariate function can be represented by a single hidden layer network with sufficiently many neurons, but the required inner functions are highly non-smooth and their explicit construction is nontrivial. This work departs from engineered KA architectures (e.g., KANs) and instead empirically studies whether KA-like geometric structures arise naturally during training of standard MLPs.

The focus is on the local geometry of the first-layer map, characterized by the Jacobian $J(\mathbf{x})$ as the input $\mathbf{x}$ varies. The central question is whether the distinctive KA geometry—marked by a majority of locally inactive coordinates and strong minor concentration in the Jacobian—emerges through gradient-based optimization, and under what conditions.

Theoretical Framework: KA Geometry

The KA theorem provides a universal representation for $f: I^n \to \mathbb{R}$ as

$f(x_1, \ldots, x_n) = \sum_{j=1}^m g_j\left(\sum_{i=1}^n \phi_{ij}(x_i)\right)$

with $m \geq 2n+1$ . The inner function $\Phi: I^n \to \mathbb{R}^m$ is constructed so that, at every input, the Jacobian $J(\mathbf{x})$ has at least $m-n$ zero rows, i.e., most hidden coordinates are locally constant. This property is essential for the iterative construction of the outer functions $g_j$ and underpins the convergence of the KA scheme.

In conventional MLPs, the first-layer map is an affine transformation followed by a nonlinearity (GeLU in this study). The Jacobian is given by

$J_{ji}(\mathbf{x}) = \sigma_j' A^T_{ji}$

where $\sigma_j'$ is the derivative of the activation. Zero rows in $J$ can arise either from vanishing $\sigma_j'$ or from sparse $A^T$ .

Empirical Methodology

The study uses 1-hidden layer MLPs with $n=3$ inputs and $m \in \{4,8,16,32\}$ hidden neurons, trained on three function types: linear (easy), xor (nonlinear, "Goldilocks" regime), and random (unlearnable). Models are initialized identically across function types and trained with Adam and MSE loss. The critical batch size is used to avoid confounding effects from batch size variation.

KA geometry is quantified via statistical analysis of the Jacobian and its exterior powers (minors), focusing on:

Zero Rows: Fraction of rows in the $k$ -th minor matrix below a data-driven threshold.
Participation Ratio: $L_1/L_2$ norm ratio of minor matrices, indicating concentration.
Random Rotation Ratio: Ratio of maximal minor to its value under random orthogonal rotations, probing alignment.
Column Divergence: KL divergence between trained and initial minor matrix columns, measuring distributional shift and alignment.

Results: Spontaneous KA Geometry

Zero Rows and Minor Concentration

Training on the xor function induces a statistically significant fraction of zero rows in the Jacobian and its higher minors, far exceeding the false-positive rate set by initialization. This is not observed for linear or random functions. The effect intensifies with increasing hidden dimension $m$ and minor size $k$ .

Figure 1: Participation ratios for size- $k$ minors across hidden dimensions, showing pronounced concentration for xor models compared to linear and random baselines.

Participation ratios decrease markedly for xor models in higher minors, indicating that a few large minors dominate while most are near zero—consistent with KA geometry. Random rotation ratios exceed unity for xor models, confirming that large minors arise from structured alignment rather than chance. Column divergences also increase with $k$ and $m$ for xor, reflecting emergent alignment in the trained inner map.

Dynamics and Interpolation

KA geometry emerges sharply as training progresses and correlates with model performance ( $R^2$ ). Interpolation experiments with the $\lambda$ -xor family reveal a "Goldilocks regime" ( $0.7 \lesssim \lambda \lesssim 1.2$ ) where KA metrics peak, coinciding with optimal learnability. For functions outside this regime (too simple or too complex), KA geometry does not develop.

Batch Size and Latent KA Geometry

Batch size modulates the emergence of KA geometry. At small batch sizes, consistently zero rows (dead neurons) appear, but these are not example-dependent and do not reflect true KA patterning. At full batch, even random functions can induce KA-like geometry in the inner map, but the linear outer map lacks capacity to exploit it. Bootstrapping with a second hidden layer enables memorization, confirming that KA geometry can arise latently even when not directly utilized.

Implications and Future Directions

The findings demonstrate that KA geometry—characterized by local inactivity and minor concentration in the Jacobian—can emerge spontaneously in shallow MLPs trained on sufficiently complex functions. This suggests that gradient-based optimization can discover highly expressive, KA-like representations without explicit architectural engineering.

Practically, the results motivate interventions to accelerate learning by promoting KA geometry, e.g., via oscillatory activation functions or targeted regularization. However, computational cost remains a concern for large models, where exhaustive analysis of Jacobians and minors is infeasible. Mapping the "KA phase diagram"—identifying when and where KA geometry emerges in deep architectures—will be essential for scalable application.

Theoretically, the work bridges classical approximation theory and modern deep learning, highlighting the relevance of fine-scale geometric patterning in neural representations. The observed coupling between large-scale learning and fine-scale KA patterning suggests new avenues for understanding abstraction and generalization in neural networks.

Conclusion

This study provides compelling evidence that KA geometry, as predicted by the Kolmogorov-Arnold theorem, can arise organically in shallow MLPs trained on nonlinear functions. The emergence of zero rows and minor concentration in the Jacobian is tightly linked to function complexity and model capacity, and is absent for trivial or unlearnable tasks. These results open the door to principled interventions for accelerating learning and abstraction in neural networks, and call for further exploration of KA geometry in large-scale and deep architectures.