Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 89 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 15 tok/s Pro
GPT-5 High 19 tok/s Pro
GPT-4o 90 tok/s Pro
Kimi K2 211 tok/s Pro
GPT OSS 120B 459 tok/s Pro
Claude Sonnet 4 36 tok/s Pro
2000 character limit reached

Lookup multivariate Kolmogorov-Arnold Networks (2509.07103v1)

Published 8 Sep 2025 in cs.LG, cs.AI, cs.PF, and cs.SE

Abstract: High-dimensional linear mappings, or linear layers, dominate both the parameter count and the computational cost of most modern deep-learning models. We introduce a general drop-in replacement, lookup multivariate Kolmogorov-Arnold Networks (lmKANs), which deliver a substantially better trade-off between capacity and inference cost. Our construction expresses a general high-dimensional mapping through trainable low-dimensional multivariate functions. These functions can carry dozens or hundreds of trainable parameters each, and yet it takes only a few multiplications to compute them because they are implemented as spline lookup tables. Empirically, lmKANs reduce inference FLOPs by up to 6.0x while matching the flexibility of MLPs in general high-dimensional function approximation. In another feedforward fully connected benchmark, on the tabular-like dataset of randomly displaced methane configurations, lmKANs enable more than 10x higher H100 throughput at equal accuracy. Within frameworks of Convolutional Neural Networks, lmKAN-based CNNs cut inference FLOPs at matched accuracy by 1.6-2.1x and by 1.7x on the CIFAR-10 and ImageNet-1k datasets, respectively. Our code, including dedicated CUDA kernels, is available online at https://github.com/schwallergroup/lmkan.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces lmKANs as an efficient drop-in replacement for high-dimensional linear mappings using multivariate spline lookup tables.
  • It leverages the Kolmogorov-Arnold theorem and custom CUDA kernels to achieve up to 88× better per-parameter inference efficiency than traditional linear layers.
  • Empirical results show significant reductions in FLOPs and inference cost across function approximation, tabular regression, and CNN benchmarks.

Lookup Multivariate Kolmogorov-Arnold Networks: Architecture, Efficiency, and Empirical Analysis

Introduction and Motivation

The paper introduces Lookup Multivariate Kolmogorov-Arnold Networks (lmKANs) as a general-purpose, drop-in replacement for high-dimensional linear mappings in deep learning architectures. The motivation stems from the observation that linear layers dominate both parameter count and computational cost in modern models, with their O(N2)\mathcal{O}(N^2) scaling. lmKANs leverage trainable low-dimensional multivariate functions, implemented as spline lookup tables, to achieve a substantially improved trade-off between model capacity and inference cost. The approach is grounded in the Kolmogorov-Arnold Representation Theorem (KART), but extends the classical univariate KAN paradigm to multivariate settings, enabling richer parametrization and more efficient computation. Figure 1

Figure 1: Performance summary of lmKANs across general function approximation, tabular regression, and CNN benchmarks, highlighting Pareto-optimality in the FLOPs-accuracy plane.

Architecture and Parametrization

Multivariate Spline Lookup Layers

lmKAN layers replace standard linear mappings with collections of trainable dd-dimensional functions, each parameterized by second-order B-splines on a static percentile grid. For the 2D case, each function f(x1,x2)f(x_1, x_2) is represented as:

f(x1,x2)=∑i1,i2pi1i2Bi1i2(x1,x2)f(x_1, x_2) = \sum_{i_1, i_2} p_{i_1 i_2} B_{i_1 i_2}(x_1, x_2)

where Bi1i2(x1,x2)B_{i_1 i_2}(x_1, x_2) are tensor products of 1D B-splines, and pi1i2p_{i_1 i_2} are trainable coefficients. The grid is constructed using a fast sigmoid-like function σ(x)\sigma(x), approximating the standard Gaussian CDF, to ensure efficient O(1)\mathcal{O}(1) lookup and balanced parameter utilization. Figure 2

Figure 2: Schematic of a 2D lmKAN layer with 4 inputs and 3 outputs, each output computed as a sum of two trainable bivariate functions.

Figure 3

Figure 3: (a) Construction of the sigma grid; (b) example piecewise linear function; (c) second-order B-spline; (d) two-dimensional B-spline basis.

Computational Complexity

The dominant cost of a 2D lmKAN layer is 2NinNout2N_{\text{in}}N_{\text{out}} fused multiply-adds, exactly 2×2\times that of a linear layer of the same shape. The number of trainable parameters per function scales as (G+1)2(G+1)^2, where GG is the number of grid intervals, allowing hundreds of times more parameters than a linear layer without increasing inference FLOPs. Custom CUDA kernels are implemented for efficient GPU inference, achieving up to 88×88\times better inference time per parameter compared to linear layers on H100 GPUs.

Regularization and Training Stability

Generalization Pitfalls

Direct fitting of high-resolution splined functions can lead to poor generalization, as only a small subset of parameters receive gradient updates, and non-active coefficients remain random or are zeroed by L2 regularization. Figure 4

Figure 4: Illustration of generalization pitfalls in high-resolution spline fitting, showing poor extrapolation outside training points.

Hessian Regularization

To address this, the paper introduces off-diagonal regularization based on the squared Frobenius norm of the Hessian, which penalizes curvature and enforces smoothness. The regularization coefficient λ\lambda allows interpolation between unconstrained lmKAN and linear MLP behavior. A multi-staged fitting procedure is employed, starting with strong regularization and gradually decaying λ\lambda to stabilize training.

Empirical Evaluation

General Function Approximation

lmKANs are benchmarked against MLPs in the task of distilling high-dimensional functions from large random teacher networks. Both architectures use two hidden layers and batch normalization. lmKANs consistently achieve up to 6×6\times fewer inference FLOPs at matched accuracy, and 1.8×1.8\times faster H100 wall-clock time for large hidden dimensions. Figure 5

Figure 5: lmKAN vs MLP for general function approximation, showing superior FLOPs-accuracy trade-off for lmKANs.

Figure 6

Figure 6: Final MSE vs grid resolution GG for hidden_dim=256 lmKAN, revealing an optimal GG due to convergence difficulties at high resolutions.

Tabular Regression: Methane Configurations

On the dataset of randomly displaced methane configurations, lmKANs outperform MLPs across multiple invariant representations. For the "Distances" modality, lmKANs deliver more than 10×10\times higher H100 throughput at equal accuracy. The regularized variant further improves generalization, especially in symmetry-preserving representations. Figure 7

Figure 7: lmKAN vs MLP on methane regression, with Hessian regularization mitigating overfitting and achieving Pareto-optimality.

Convolutional Neural Networks

lmKAN-based CNNs are evaluated on CIFAR-10 and ImageNet. Replacing standard convolutions with lmKAN-based ones reduces inference FLOPs by $1.6$–2.1×2.1\times on CIFAR-10 and 1.7×1.7\times on ImageNet, with matched or superior accuracy. Figure 8

Figure 8: Comparison of MLP-based and lmKAN-based CNNs on CIFAR-10 and ImageNet, demonstrating FLOPs reduction at matched accuracy.

Comparison with FastKAN

lmKANs are compared to FastKANs in fully-connected settings. lmKANs exhibit superior training stability and accuracy, especially at high parameter budgets, due to the coarser grid and multivariate parametrization, which restricts expressivity to lower-frequency function classes. Figure 9

Figure 9: lmKAN vs FastKAN on CIFAR-10, showing lmKAN's robustness to grid resolution and superior accuracy.

Implementation and Scaling Considerations

CUDA Kernel Performance

lmKAN CUDA kernels are benchmarked on H100 SXM GPUs. For large dimensions, the kernels are 8×8\times slower than dense linear layers, but the per-parameter efficiency is 27.5×27.5\times better at G=20G=20, and 88.5×88.5\times better at G=40G=40 with smaller tiles. For small feature dimensions, the slowdown is only 2.5×2.5\times. Figure 10

Figure 10: CUDA kernel performance for large dimensions, showing time normalized by shape and by parameter count.

Figure 11

Figure 11: CUDA kernel performance for small dimensions, with improved relative efficiency.

Figure 12

Figure 12: Inference efficiency of lmKAN layers as a function of grid intervals GG, confirming independence of throughput from GG.

Training Pipeline

A multi-staged fitting procedure is used, starting with pure MLP mode, gradually increasing lmKAN contribution, and decaying Hessian regularization. Preconditioning schemes (ReLU-first vs ReLU-last) are compared, with ReLU-first offering absorption into lmKAN layers and lower inference cost. Figure 13

Figure 13: Multi-staged fitting procedure, showing loss dynamics and regularization decay.

Figure 14

Figure 14: Preconditioning scheme comparison on CIFAR-10.

Figure 15

Figure 15: Preconditioning scheme comparison in general function approximation.

Limitations

lmKANs are harder to converge at excessively high grid resolutions GG, and large GG increases memory requirements. Current CUDA kernels are implemented only for float32; support for lower-precision types (e.g., bfloat16) is pending. Latency, while not dependent on GG in throughput, can be affected in practice.

Implications and Future Directions

lmKANs provide a principled and efficient alternative to high-dimensional linear mappings in deep learning, with strong empirical evidence for Pareto-optimality in the FLOPs-accuracy plane across diverse tasks. The approach is compatible with existing architectures and can be extended to higher-dimensional splines, though with increased computational cost. The use of multivariate spline lookup tables opens avenues for further research in conditional computation, hardware co-design, and efficient parametrization of complex function classes. Future work may focus on improved optimization schemes for high-resolution grids, support for mixed-precision inference, and integration with domain-specific architectures.

Conclusion

Lookup multivariate Kolmogorov-Arnold Networks (lmKANs) represent a significant advancement in the efficient parametrization and computation of high-dimensional mappings in deep learning. By leveraging multivariate spline lookup tables and custom GPU kernels, lmKANs achieve substantial reductions in inference FLOPs and wall-clock time while maintaining or improving accuracy across general function approximation, tabular regression, and convolutional settings. Theoretical and empirical analyses support the superiority of multivariate parametrization over univariate alternatives, and the proposed regularization and training schemes ensure robust generalization. lmKANs are well-positioned for broad adoption as a scalable, efficient replacement for linear layers in modern neural architectures.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com

alphaXiv