Kolmogorov-Arnold Network (KAN)
- KAN is a neural network architecture that employs learnable univariate functions instead of scalar weights, leveraging the Kolmogorov–Arnold theorem for function representation.
- KAN achieves high expressive power with orders-of-magnitude fewer parameters compared to conventional MLPs, using B-spline basis functions for efficient nonlinearity.
- KAN enables improved interpretability and hardware acceleration through direct extraction of learned function transformations and optimized LUT-based computation.
Kolmogorov Arnold Network (KAN) is a neural network architectural paradigm that implements the Kolmogorov–Arnold superposition theorem in a deep, trainable framework. By replacing fixed scalar weights and activations with learnable univariate functions—most commonly parametrized by B-splines—KANs provide a parameter-efficient, highly expressive, and interpretable alternative to traditional multilayer perceptrons (MLPs). This structure enables the network to achieve comparable or superior expressive power with orders-of-magnitude fewer parameters, while also enabling direct extraction and analysis of the functional transformations learned on each edge. The design and hardware implementation of KANs present novel algorithmic and system-level challenges, especially in the context of large-scale deployment and efficient analog or in-memory acceleration.
1. Kolmogorov–Arnold Theoretical Foundations and Mathematical Formulation
KANs are directly inspired by the Kolmogorov–Arnold representation, which states that any continuous multivariate function can be represented as a finite sum of univariate continuous functions of affine combinations of inputs: where both the inner functions and outer functions are continuous and effectively reduce the problem of high-dimensional function learning to learning one-dimensional functionals.
KAN generalizes this construction into a layered neural network where each edge corresponds to a learnable univariate function (rather than a scalar weight). The canonical KAN-layer is defined as: where is a standard activation function (e.g., ReLU), are B-spline basis functions of order over a knot grid of size , and are trainable coefficients (often quantized for hardware deployment) (Huang et al., 7 Sep 2025).
This architecture admits substantial parameter reduction: KANs with B-splines of modest order and grid size achieve equivalent expressiveness to MLPs, but with fewer parameters per layer (Noorizadegan et al., 28 Oct 2025). Theoretical and empirical analysis demonstrates KANs provide faster scaling laws and approximation rates—up to for degree- splines—than classical MLPs, which are limited by their fixed-pointwise nonlinearity (Liu et al., 30 Apr 2024, Noorizadegan et al., 28 Oct 2025).
2. KAN Architecture, Basis Function Parameterization, and Variants
In KANs, each univariate function is typically expanded in a basis: with the choice of basis critically influencing smoothness, locality, regularization, and computational cost (Noorizadegan et al., 28 Oct 2025). Common basis choices include cubic B-splines (for compact support and locality), Chebyshev or Jacobi polynomials (for spectral properties), ReLU polynomials, Gaussian RBFs, Fourier series, and bandlimited Sinc expansions. Advanced variants layer multiple basis types, allow grid/knot adaptation, or enable rational/fractional-order warping (e.g., rKAN, fKAN).
The most widely adopted form is the cubic B-spline KAN, with grid-size as a robust baseline. For discontinuous or highly oscillatory functions, Sinc or rational Jacobi basis functions are preferred (Noorizadegan et al., 28 Oct 2025).
KANs can be extended with residual connections, gating mechanisms, hybrid MLP-KAN stacks, or domain-decomposition strategies (such as FBKAN and SPIKAN). The architectural choices, including basis selection and adaptation, network depth, and parameter sharing, are application- and domain-specific.
3. Algorithm-Hardware Co-Design and Hardware Acceleration
The central hardware challenge in KAN deployment arises from the complexity of B-spline evaluation, which involves recursive Cox–de Boor computation with many divides and multiplications. Standard practice for hardware mapping is LUT-based evaluation, where values of are precomputed and accessed via address decoding and multiplexing—a strategy that, while fast, incurs large area and energy overheads (Huang et al., 7 Sep 2025).
To address this, algorithm–hardware co-design strategies have emerged:
- ASP-KAN-HAQ (Alignment-Symmetry & PowerGap KAN Hardware-Aware Quantization): Enforces knot-quantization alignment and leverages LUT symmetry, allowing storage sharing and significant reductions in decoder and multiplexer circuitry. This yields up to area and energy reductions over naive post-training quantization for grid sizes up to 64.
- KAN-SAM (KAN Sparsity-Aware Mapping): Maps basis coefficients onto array rows in descending order of activation criticality (a function of activation statistics, mean, variance, and coefficient importance), optimizing placement for robustness to IR-drop in RRAM-ACIM arrays. This strategy reduces accuracy loss by – compared to uniform mapping.
- N:1 Time-Modulation Dynamic-Voltage Input Generator (TM-DV-IG): Hybrid digital-analog input encoding solution that combines voltage- and pulse-width-based quantization to balance speed, resolution, noise margin, and circuit area. This technique achieves up to – joint improvement in area/power/latency compared to pure voltage- or PWM-based schemes.
At the circuit level, analog compute-in-memory multiply-accumulate (ACIM MAC) with RRAM cells encodes bit-sliced coefficients, and incorporates mitigation techniques for partial sum deviations due to process variation and IR-drop (Huang et al., 7 Sep 2025).
4. Empirical Evaluation, Scaling, and Model Efficiency
Large-scale validation of these algorithm–hardware codesign strategies is presented on commercial 22nm RRAM-ACIM technology (Huang et al., 7 Sep 2025). Key benchmarks include collaborative filtering (CF-KAN) recommendation models with parameter budgets of 39MB and 63MB—representing to increase over tiny models.
Scaling metrics include:
- Area Overhead: Increases by only to despite to increase in parameter count.
- Power Consumption: Rises moderately (factor to ).
- Accuracy Degradation: Minimal (), far less than typical scaling penalties in classical DNN frameworks.
- End-to-end Latency: Measured at for inference.
These results establish the feasibility of scaling KANs to large inference workloads with disciplined co-design, while maintaining the parameter efficiency and interpretability intrinsic to the architecture (Huang et al., 7 Sep 2025).
5. Comparison with Deep Neural Network Architectures and Parameter Efficiency
In contrast to conventional DNNs, where each layer involves a large weight matrix followed by fixed nonlinearities, KAN replaces linear weights with a sparse, learnable ensemble of univariate nonlinear transformations. This results in:
- Dramatic Parameter Reduction: For a layer of size , KANs require only trainable coefficients per channel, as opposed to weights, yielding typical reductions of three to five orders of magnitude.
- Interpretability: Each basis function is visible and can be visualized or locked to analytic forms, enabling post-training symbolic regression and human-driven discovery (Liu et al., 30 Apr 2024).
- Spectral and Scaling Behavior: KANs empirically achieve higher scaling exponents in error-parameter count power law (), and demonstrate weaker spectral bias (ability to fit high-frequency modes efficiently), as confirmed by Neural Tangent Kernel analyses (Noorizadegan et al., 28 Oct 2025).
- Convergence and Robustness: KANs exhibit robust convergence properties with proper regularization, though high-degree splines demand careful training and numerical engineering, especially in large-scale or high-dimensional settings.
6. Advanced Topics: Grid Adaptation, Quantization, and Practical Considerations
Optimal KAN deployment in hardware involves grid adaptation (dynamic increase of knot intervals as validation loss decreases), grid alignment with quantization, LUT sharing, and selection of local/global decoder splits for minimized storage and logic (Huang et al., 7 Sep 2025). Quantization (commonly to 8 bits) is essential for deploying trained coefficients in memory arrays.
Mapping decisions in RRAM-based MAC arrays must account for the non-idealities of analog hardware, including device variability and IR-drop, necessitating data-driven, sparsity-aware placement (KAN-SAM) and architecture-aware optimization. All metric and mapping decisions require detailed empirical validation within the constraints of the target hardware platform.
7. Outlook and Continuing Research Directions
KANs are now an established direction for parameter- and power-efficient machine learning, with empirically validated scaling in large-scale hardware. Open research directions include further reduction of inference latency, integration with domain-decomposition and symbolic discovery workflows, formal theory of basis function selection and adaptation, and extension to more diverse analog and in-memory platforms.
Ongoing work aims to standardize robust component libraries for each major basis family, formalize the connections between basis complexity and generalization, and expand the class of smooth and non-smooth functions efficiently realizable within the KAN framework. Hardware-level research will continue to focus on LUT optimization, low-power quantization, and efficient mapping of KANs to hybrid digital-analog inference accelerators (Huang et al., 7 Sep 2025, Noorizadegan et al., 28 Oct 2025).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free