Convolutional Kolmogorov-Arnold Networks

Updated 9 November 2025

Convolutional KANs are neural architectures that integrate the Kolmogorov–Arnold theorem with learnable univariate functions, replacing fixed convolutional weights.
They enhance model expressivity and parameter efficiency by leveraging basis expansions like splines, polynomials, or wavelets in convolutional operations.
Variants such as bottleneck and MoE designs optimize trade-offs between accuracy, computational cost, and robustness across vision, time series, and financial applications.

Convolutional Kolmogorov-Arnold Networks (Convolutional KANs) are neural architectures that integrate the Kolmogorov–Arnold superposition principle with convolutional and edge-based learnable nonlinearities. Central to these models is the idea that instead of using fixed kernel weights in convolutional layers, each “weight” is replaced by a univariate function—typically a spline, polynomial, or wavelet expansion—whose parameters are learned. This mechanism yields architectures with enhanced expressivity and parameter efficiency, offering an alternative to standard convolutional neural networks (CNNs) and multi-layer perceptrons. The theoretical backbone is the classical Kolmogorov–Arnold representation theorem, which asserts that any multivariate continuous function can be written as finite sums of univariate compositions and additions, a property rendered constructive in the Convolutional KAN setting.

1. Kolmogorov–Arnold Representation and Network Foundations

The Kolmogorov–Arnold representation theorem guarantees for any continuous function $f : [0,1]^n \to \mathbb{R}$ a decomposition: $f(x_1,\dots,x_n) = \sum_{q=1}^{2n+1} \Phi_q\left( \sum_{p=1}^n \phi_{q,p}(x_p) \right)$ with univariate continuous functions $\phi_{q,p}$ and $\Phi_q$ . Kolmogorov–Arnold Networks (KANs) realize this by parameterizing the scalar “weights” between nodes as learnable nonlinear functions, often via splines, polynomials, or other local basis expansions.

A KAN layer with input $x \in \mathbb{R}^{n_l}$ and output $x' \in \mathbb{R}^{n_{l+1}}$ applies: $(x')_i = \sum_{j=1}^{n_l} \phi_{i,j}(x_j)$ for scalar functions $\phi_{i,j}$ , with these functions replacing the conventional scalar weights. Activation functions $\phi_{i,j}$ are commonly implemented as B-splines: $\phi_{i,j}(x) = \sum_k c_k B_k(x)$ where $B_k$ are spline basis functions and $c_k$ are coefficients.

2. Convolutional KAN Layer Formulation

The convolutional extension replaces scalar weights in standard CNN convolutions with learnable univariate functions (e.g., splines). For an input tensor $Y \in \mathbb{R}^{C \times H \times W}$ and a Conv-KAN kernel of spatial size $k \times k$ , the output is: $X_{p,i,j} = \sum_{d=1}^C \sum_{a=0}^{k-1} \sum_{b=0}^{k-1} \phi_{p,d,a,b}( Y_{d,i+a,j+b} )$ where each $\phi_{p,d,a,b}: \mathbb{R} \to \mathbb{R}$ is a trainable univariate function for each channel and spatial offset (Drokin, 1 Jul 2024).

Variants adopt polynomial (Chebyshev, Gram, Legendre), spline, or wavelet bases, and the convolution can be parameter-efficiently realized via 1×1 “bottleneck” convolutions that reduce the number of functional mappings per output channel (Drokin, 1 Jul 2024).

3. Architectural Designs and Hybrids

Direct Conv-KAN (Spline or Polynomial Basis)

A direct Conv-KAN layer replaces all kernel weights with learnable univariate functions, potentially increasing parameter count and complexity. This design is typically employed in small-scale tasks (e.g. MNIST, Fashion-MNIST), as in (Bodner et al., 19 Jun 2024), where competitive accuracy is achieved (e.g., 98.9% on MNIST with ~95K parameters).

Bottleneck and MoE Variants

To improve efficiency, a 1×1 reduction (channel squeeze), followed by spatial Conv-KAN (on the reduced dimension), and then 1×1 expansion is applied. This design is markedly more parameter and FLOPs-efficient (Drokin, 1 Jul 2024). Mixture-of-Experts (MoE) and dense skip connectivities are also utilized in deep architectures to facilitate gradient flow (Drokin, 1 Jul 2024).

Residual and Interactive Blocks

RKAN (Residual Kolmogorov-Arnold Network) introduces Conv-KAN blocks as residual branches in ResNet or DenseNet pipelines. The RKAN module incorporates Chebyshev polynomial expansions, normalization, linear bypasses, and batch normalization, consistently improving accuracy, especially on vision benchmarks such as Tiny ImageNet and CIFAR-100 (Yu et al., 7 Oct 2024).

KANICE (Ferdaus et al., 22 Oct 2024) inserts Interactive Convolutional Blocks (parallel 3×3/5×5 convs + GELU × product) followed by KANLinear (global superposition layers), significantly improving accuracy across MNIST/EMNIST/SVHN benchmarks.

4. Training Methodologies and Regularization

Loss and Optimization

Standard loss functions (cross-entropy for classification, MSE for regression/option pricing) are adopted. Parameterizations are optimized using Adam or SGD with momentum (learning rates in the $10^{-3}$ to $10^{-5}$ range) (Li et al., 2 Dec 2024, Drokin, 1 Jul 2024, Ferdaus et al., 22 Oct 2024).

Regularization Strategies

L1/L2 weight decay on spline coefficients and functional parameters (Cang et al., 11 Nov 2024, Drokin, 1 Jul 2024).
Spline smoothness regularization: penalizes large second derivatives of spline segments to curb overfitting to noise (Cang et al., 11 Nov 2024).
Segment deactivation: randomly linearizes spline segments during training (analogous to Dropout), reducing complexity and variance (Cang et al., 11 Nov 2024).
Dropout and noise injection tailored for polynomial-parameterized KAN layers, either before the Conv-KAN layer as a whole, pre-basis computation, or per polynomial degree (Drokin, 1 Jul 2024).
Early stopping on validation splits for data with severe nonstationarity, e.g., in financial time series (Li et al., 2 Dec 2024).

Initialization and Stability

Spline control points are initialized to approximate linear or SiLU activations; knot grids are typically uniformly distributed around the expected activation value range (Bodner et al., 19 Jun 2024). Batch normalization or dynamic grid extension is applied when inputs exceed the support of the splines.

5. Empirical Performance Across Domains

Computer Vision

Conv-KANs have demonstrated competitive accuracy with up to 40–50% fewer parameters on MNIST and Fashion-MNIST (Bodner et al., 19 Jun 2024, Ferdaus et al., 22 Oct 2024). On Tiny ImageNet, the RKAN–ResNet-50 model achieves 64.40% top-1 (vs. 62.48% for baseline), with moderate GFLOPS and parameter increases (Yu et al., 7 Oct 2024). Bottleneck Conv-KANs with Gram bases achieve 74.59% top-1 on ImageNet-1k (outperforming ResNet-34) (Drokin, 1 Jul 2024).

On more complex data (CIFAR-10, CIFAR-100, segmentation tasks), CKANs require deeper/wider networks to match standard CNNs, with computational overheads scaling (e.g., KConvKAN-8, 40.7M parameters for 78.8% accuracy on CIFAR-10 (Azam et al., 13 Jun 2024)). KANICE improves MNIST accuracy to 99.35% (outperforming CNN/ICB hybrids), and the KANICE-mini variant delivers comparable accuracy (99.13%) with <10% of full parameter size (Ferdaus et al., 22 Oct 2024).

Time Series and Option Pricing

Conv-KANs achieve lowest MSE, RMSE, and MAE in option pricing on the CSI 300 Index options dataset, outperforming LSTM, Conv-LSTM, dense KAN, and even parametric B-S-M formulae (e.g., MSE=0.00790 vs. 0.01552 for B-S-M) (Li et al., 2 Dec 2024). The architecture robustly generalizes to near-maturity options not seen in training, with parameter advantage via kernel-level univariate mappings.

Semantic Segmentation

In U-Net and U²-Net configurations (with Conv-KAN replacements), KAGNet with bottlenecked Conv-KANs attains state-of-the-art performance on segmentation datasets (e.g., BUSI IoU/F1=63.45/77.64) (Drokin, 1 Jul 2024).

6. Expressivity, Efficiency, and Theoretical Properties

Conv-KANs introduce higher-order expressivity into each convolutional kernel position via spline or polynomial mapping, theoretically allowing single 3×3 degree-3 KAN filters to capture richer nonlinear functions than stacks of linear+ReLU layers (Yu et al., 7 Oct 2024). Bottleneck architectures and basis selection (particularly Gram polynomials) optimize the tradeoff between accuracy and parameter count (Drokin, 1 Jul 2024). The edge-wise decomposition follows the Kolmogorov–Arnold superposition hierarchy, ensuring universal approximation.

Parameter-efficient fine-tuning is enabled by selectively updating a subset of polynomial coefficients, yielding high accuracy with as little as 10% of parameters requiring adaptation (Drokin, 1 Jul 2024).

However, increased memory footprint and computational cost are documented, especially as spline/functional kernel counts grow with channel and kernel dimensions. Naive CKANs are less efficient on high-complexity/natural image tasks without regularization or bottlenecking (Cang et al., 11 Nov 2024, Azam et al., 13 Jun 2024).

7. Limitations and Prospective Directions

Practical challenges include increased training time (~2–6× slower per epoch), dynamic grid management for splines, and tuning of basis type, spline order, and regularization weight. In image domains, Conv-KANs without regularization may overfit noise; smoothness constraints and segment deactivation are particularly effective for generalization (Cang et al., 11 Nov 2024).

Current research explores:

Scalable bottlenecked and MoE architectures for high-dimensional image data (Drokin, 1 Jul 2024).
Transferability and PEFT (parameter-efficient fine-tuning) for resource-constrained model adaptation (Drokin, 1 Jul 2024).
Extending Conv-KANs to attention modules and focal modulation for hybrid architectures (Drokin, 1 Jul 2024).
Automated adaptive grid sizing and basis selection to minimize parameter overhead in deep stacks (Ferdaus et al., 22 Oct 2024).
Adversarial robustness properties and integration with transformer-style models (Ferdaus et al., 22 Oct 2024).
Large-scale vision and non-Euclidean domains (e.g., graph learning) (Yu et al., 7 Oct 2024, Azam et al., 13 Jun 2024).

In summary, Convolutional Kolmogorov–Arnold Networks establish a framework in which convolutional and fully connected layers are generalized to operate via edge-wise or kernel-wise learnable univariate functions, enabling high-order nonlinear approximation, parameter savings, and robustness to noise when regularized appropriately. Their realization in computer vision, financial modeling, and biomedical imaging illustrates both their practical value and the computational hurdles that remain in large-scale deployment.