Kolmogorov–Arnold Attention (KAA)

Updated 6 January 2026

Kolmogorov–Arnold Attention is a neural mechanism that uses function compositions based on the Kolmogorov–Arnold theorem to generalize conventional dot-product attention.
It employs nonparametric kernels and basis expansions such as Fourier, spline, and Chebyshev to enable universal approximation and improved performance across vision, graph, and physics-informed models.
It achieves parameter efficiency through low-rank factorization while boosting expressivity, leading to measurable advances in accuracy and resource savings in various architectures.

Kolmogorov–Arnold Attention (KAA) refers to neural attention mechanisms constructed upon the Kolmogorov–Arnold superposition theorem, generalizing conventional dot-product attention and incorporating function-composition models such as Kolmogorov–Arnold Networks (KANs). KAA encodes both the similarity computation (scoring) and aggregation (weighted combination) in attention layers as learnable or kernel-based compositions of univariate functions, thereby potentially achieving universal approximation properties in the mapping from input sequences to output tokens or node features. Recent literature demonstrates KAA’s applicability in Vision Transformers, graph neural networks, scientific modeling, and efficient feature fitting architectures, leading to measurable advances in expressivity, parameter efficiency, and task performance (Liu et al., 29 Mar 2025, Fang et al., 23 Jan 2025, Marouani et al., 24 Dec 2025, Maity et al., 13 Mar 2025, Thrainer et al., 16 Jul 2025, Zhang et al., 13 May 2025).

1. Theoretical Foundations and Mathematical Formulation

Kolmogorov–Arnold Attention is grounded in the Kolmogorov–Arnold representation theorem, which states that for any continuous multivariate function $f:\mathbb{R}^D\rightarrow \mathbb{R}^E$ , there exist univariate outer functions $\phi_h$ and inner functions $\psi_{h,d}$ such that

$f(x) = \sum_{h=1}^H \phi_h \left( \sum_{d=1}^D \psi_{h,d}(x_d) \right),$

with $H = 2D+1$ for compact domains (Liu et al., 29 Mar 2025, Maity et al., 13 Mar 2025). KAA interprets self-attention blocks—originally cast as dot-product similarities plus softmax aggregation—as instances of such representations, where both steps are realized via kernel expansions. In practice, the inner function (scoring or similarity) and the outer function (aggregation) can be implemented as linear combinations of kernel functions, B-splines, Fourier bases, wavelets, or Chebyshev polynomials (Liu et al., 29 Mar 2025, Maity et al., 13 Mar 2025, Zhang et al., 13 May 2025). This framework allows for arbitrary nonlinearity and high expressive capacity within attention.

Standard self-attention becomes a special case under this formulation by selecting the linear kernel $k(x,y) = x \cdot y$ , along with inner product operations and linear aggregation. KAA extends this to nonparametric or nonlinear kernels, and may use low-rank factorization to improve computational efficiency (Liu et al., 29 Mar 2025, Maity et al., 13 Mar 2025).

2. Architectures and Instantiations

KAA mechanisms have been instantiated in multiple neural architectures:

Vision Transformers (ViTs): The KArAt module replaces softmax normalization in each attention head with a learnable Kolmogorov–Arnold operator, often parameterized via Fourier, spline, or wavelet bases. The full N×N activation map is approximated by a low-rank factorization for scalability, with empirical results showing improvement on small ViT models (Maity et al., 13 Mar 2025).
Graph Neural Networks (GNNs): KAA replaces or augments linear/MLP-based scoring functions with KANs, leading to higher expressive power for neighbor ranking and node aggregation (Fang et al., 23 Jan 2025, Marouani et al., 24 Dec 2025). Message-passing and scoring steps both use KANs operating over spline or polynomial bases.
Structural Segmentation Networks: FORTRESS uses KAA blocks to merge spatial, channel, and KAN-enhanced attention streams, leveraging depthwise separable convolutions and tiny KANs at decoder levels, achieving substantial parameter and FLOP reductions (Thrainer et al., 16 Jul 2025).
Physics-Informed Networks: AC-PKAN interleaves Chebyshev polynomial-based KANs with feature-wise internal attention gates and external Residual Gradient Attention to address rank collapse and gradient balancing in PINNs (Zhang et al., 13 May 2025).

Component Overview:

Domain	KAA Placement	Basis Functions
ViTs (KArAt)	Aggregation operator	Fourier/Spline/etc.
GNNs (KAMP-Attn/KAA)	Score-mapping	Spline/Chebyshev
Segmentation (FORTRESS)	Feature fusion	Spline/Cubic B-spline
PINNs (AC-PKAN)	Internal/external attention	Chebyshev/Wavelet

3. Unified Kernel-Based Feature Fitting Framework

The kernel-superposition view is formalized as follows. For batched sequence inputs $X \in \mathbb{R}^{B \times S \times D}$ , any mapping $f = \Phi \circ \Psi$ from inputs to outputs can be written as nested kernel expansions: \begin{align*} \Psi(X){b,s,h} &= \sum{r,d_1,d_2} K^{{\text{inner}}(X,} \text{Ref}^{{\text{inner}})_{b,s,r,d_1,d_2}} W^{{\text{inner}}_{h,r,d_1,d_2},} \ \Phi(\Psi){b,s,e} &= \sum{r',h_1,h_2} K^{{\text{outer}}(\Psi,} \text{Ref}^{{\text{outer}})_{b,s,r',h_1,h_2}} W^{{\text{outer}}_{e,r',h_1,h_2},} \end{align*} where $K(\cdot, \cdot)$ denotes a kernel function, and $W$ are learnable mixing weights (Liu et al., 29 Mar 2025). By selecting appropriate kernels, KAA recovers various attention models:

Linear kernel yields standard MHSA;
Gaussian kernel realizes fully nonparametric attention;
Low-rank kernels provide parameter-efficient variants (Pseudo-MHSA) (Liu et al., 29 Mar 2025).

4. Parameter Efficiency and Low-Rank Attention Variants

KAA supports reduced-parameter self-attention via low-rank approximations. In Pseudo-MHSA, the attention weight matrix $W^{\text{attn}}$ is factorized as $U A U^T$ , with $U$ orthonormal and $A$ small. The computation per head thus involves projecting inputs through $U$ , applying $A$ , and aggregating outputs through further projections (Liu et al., 29 Mar 2025, Maity et al., 13 Mar 2025).

Benchmarks indicate Pseudo-MHSA retains nearly full performance compared to standard MHSA on CIFAR-10 (accuracy 0.8144 vs. 0.8162), but with approximately half the parameters. Gaussian-MHSA leverages a Gaussian kernel for nonparametric similarity calculation, operating at extreme parameter savings (0.256M params; 76.6% CIFAR-10 accuracy) (Liu et al., 29 Mar 2025).

In transformer regimes (KArAt), low-rank factorization is crucial: empirical SVD reveals natural attention weights to be highly low-rank ( $\approx 8$ –16 singular values), so restricting learned compositions to this space suffices; full N×N learnable operators are intractable due to memory constraints (Maity et al., 13 Mar 2025).

5. Expressivity Analysis and Maximum Ranking Distance

In GNNs, the ability of an attention-based scorer to rank neighbors arbitrarily is measured using Maximum Ranking Distance (MRD). Theoretically, linear scoring functions and shallow MLPs are provably limited in their capacity to approximate arbitrary rankings. In contrast, a single-layer KAN scorer (using zero-order B-splines) yields MRD $\leq \delta$ for every $\delta > 0$ , i.e., essentially infinite expressivity under practical parameter constraints:

$\text{MRD(KAA)} \leq \text{MRD(MLP)} \leq \text{MRD(Linear)}$

Even with moderate parameterization, KAA thus matches or exceeds the expressivity attainable by much deeper/wider MLP architectures (Fang et al., 23 Jan 2025). Empirically, KAA-augmented GNNs achieve state-of-the-art gains in node classification and graph-level tasks (up to +20pp F1 on PPI), and consistent improvements for transformer-style architectures at node- and graph-level (+2–3pp accuracy) (Fang et al., 23 Jan 2025).

6. Empirical Performance and Implementation

KAA-augmented models have demonstrated tangible task-level advantages:

Vision: Small ViTs with KArAt yield +5–6% accuracy improvements; larger models show diminished returns or instability due to spiky loss landscapes (Maity et al., 13 Mar 2025).
Graph modeling: FlowKANet uses KAMP-Attn to deliver 5× parameter reductions and 6% R² improvements in flow delay prediction (Marouani et al., 24 Dec 2025).
Structural segmentation: FORTRESS achieves 91% parameter and FLOP reductions (31M→2.9M), 3× inference speedup, and SOTA segmentation (F1=0.771, mIoU=0.677) by fusing KAA with multi-scale attention and adaptive TiKAN integration (Thrainer et al., 16 Jul 2025).
Physics-Informed Learning: AC-PKAN integrates internal KAA and external Residual Gradient Attention (RGA), outperforming PINNs and other KAN variants on zero-data PDE benchmarks, achieving the lowest relative MAE (e.g., 0.0011 on 1D Wave) (Zhang et al., 13 May 2025).

Implementation details frequently feature spline or Fourier basis parameterization (GPU-friendly for ViT); low-rank modules; lightweight scoring/aggregation; and judicious activation placement. Pseudocode and architectural blueprints are available in open-source repositories (Thrainer et al., 16 Jul 2025, Marouani et al., 24 Dec 2025).

7. Limitations, Distinctions, and Future Directions

Not all models labeled as “Kolmogorov–Arnold” integrate KAA at the attention level. For example, TKAT introduces TKAN cells in place of LSTMs but retains standard attention; no new query/key/value formulas or score gate variants are present (Genet et al., 2024). The expressivity gains and parameter reductions of KAA hinge upon efficient function composition implementations and low-rank factorization; overhead and memory usage remain practical concerns for very large transformers (Maity et al., 13 Mar 2025). B-spline bases are less GPU suitable compared to Fourier/wavelet in large-scale deployment.

Benchmarks confirm significant representational and efficiency gains in domains including vision, graph modeling, structural segmentation, and PDE solution. Ongoing research addresses scalability, multimodal fusion, detection/segmentation extension, and memory-efficient learnability for high-dimensional inputs.

Kolmogorov–Arnold Attention generalizes self-attention mechanisms by allowing flexible, kernel-based function compositions in scoring and aggregation. This enables universal approximation, parameter-efficient architectures, and improved empirical performance in several machine learning domains, provided computational practicality and proper architectural design (Liu et al., 29 Mar 2025, Fang et al., 23 Jan 2025, Marouani et al., 24 Dec 2025, Maity et al., 13 Mar 2025, Thrainer et al., 16 Jul 2025, Zhang et al., 13 May 2025).