Papers
Topics
Authors
Recent
Search
2000 character limit reached

ReLU-KAN: GPU-Optimized Kolmogorov–Arnold Network

Updated 18 April 2026
  • ReLU-KAN is a neural architecture that adapts Kolmogorov–Arnold Networks by replacing B-spline functions with square-of-ReLU activations, ensuring universal approximation with GPU efficiency.
  • It leverages a matrix-based operation with localized activations to achieve up to 30× faster training and improved stability in function approximation and physics-informed neural network applications.
  • Despite its computational advantages, ReLU-KAN faces challenges like parameter inefficiency and hard gating, spurring extensions such as HRKAN and AF-KAN for enhanced performance.

ReLU-KAN is a neural architecture that adapts Kolmogorov–Arnold Networks (KANs) by replacing B-spline basis functions with “square-of-ReLU” units, thereby enabling fully matrix-based, GPU-optimized computation while retaining universal approximation power. ReLU-KAN rigorously follows the KAN superposition principle, employs ReLU-centered bell-shaped local activations, and achieves substantial acceleration and improved stability in practical deep learning applications, especially in physics-informed neural networks and function approximation contexts (Qiu et al., 2024, So et al., 2024, Ta et al., 8 Mar 2025).

1. Mathematical Foundation and Definition

ReLU-KAN is grounded in the Kolmogorov–Arnold representation theorem, which states that any continuous function f:[a,b]dRf: [a, b]^d \to \mathbb{R} can be written as a superposition

f(x1,,xd)=q=02dΦq(p=1dϕq,p(xp))f(x_1, \ldots, x_d) = \sum_{q=0}^{2d} \Phi_{q}\left( \sum_{p=1}^{d} \phi_{q,p}(x_p) \right)

where ϕq,p\phi_{q,p} and Φq\Phi_q are univariate continuous functions.

A Kolmogorov–Arnold Network (KAN) implements this construction as a neural module, replacing each edge in a multilayer perceptron with a learnable, parameterized univariate function. The canonical KAN uses B-spline basis functions, with each hidden unit expansion:

vi=j=1nαi,jbj(hi)v_i = \sum_{j=1}^{n} \alpha_{i,j} b_j(h_i)

where bj()b_j(\cdot) are B-spline basis functions.

ReLU-KAN departs from this by introducing a localized “square-of-ReLU” bump for each interval:

ri(x)=[ReLU(eix)ReLU(xsi)]2cr_i(x) = [\mathrm{ReLU}(e_i - x) \cdot \mathrm{ReLU}(x - s_i)]^2 \cdot c

where si,eis_i, e_i define the interval, and the normalization factor c=16/(eisi)4c = 16/(e_i - s_i)^4 ensures the bump peaks at 1. The univariate expansion becomes:

ϕ(x)=i=1nαiri(x)\phi(x) = \sum_{i=1}^{n} \alpha_{i} r_i(x)

This change enables all forward and backward passes to be implemented with pointwise elementary matrix operations and small convolutions, crucial for GPU efficiency (Qiu et al., 2024, Ta et al., 8 Mar 2025).

2. Network Architecture and Computational Pipeline

A prototypical ReLU-KAN layer accepts an input vector f(x1,,xd)=q=02dΦq(p=1dϕq,p(xp))f(x_1, \ldots, x_d) = \sum_{q=0}^{2d} \Phi_{q}\left( \sum_{p=1}^{d} \phi_{q,p}(x_p) \right)0 and applies:

  1. Matrix Linear Map: f(x1,,xd)=q=02dΦq(p=1dϕq,p(xp))f(x_1, \ldots, x_d) = \sum_{q=0}^{2d} \Phi_{q}\left( \sum_{p=1}^{d} \phi_{q,p}(x_p) \right)1, with f(x1,,xd)=q=02dΦq(p=1dϕq,p(xp))f(x_1, \ldots, x_d) = \sum_{q=0}^{2d} \Phi_{q}\left( \sum_{p=1}^{d} \phi_{q,p}(x_p) \right)2 (or appropriate dimension).
  2. Componentwise “Square-of-ReLU” Expansion: For each channel,

f(x1,,xd)=q=02dΦq(p=1dϕq,p(xp))f(x_1, \ldots, x_d) = \sum_{q=0}^{2d} \Phi_{q}\left( \sum_{p=1}^{d} \phi_{q,p}(x_p) \right)3

where f(x1,,xd)=q=02dΦq(p=1dϕq,p(xp))f(x_1, \ldots, x_d) = \sum_{q=0}^{2d} \Phi_{q}\left( \sum_{p=1}^{d} \phi_{q,p}(x_p) \right)4 is defined as above.

  1. Channel Summation and Output: A f(x1,,xd)=q=02dΦq(p=1dϕq,p(xp))f(x_1, \ldots, x_d) = \sum_{q=0}^{2d} \Phi_{q}\left( \sum_{p=1}^{d} \phi_{q,p}(x_p) \right)5 convolution sums over the basis axis for each feature, returning a single scalar per output channel.

These steps reduce to a handful of broadcasted additions, elementwise ReLU, pointwise multiplies, and a grouped f(x1,,xd)=q=02dΦq(p=1dϕq,p(xp))f(x_1, \ldots, x_d) = \sum_{q=0}^{2d} \Phi_{q}\left( \sum_{p=1}^{d} \phi_{q,p}(x_p) \right)6 convolution, all of which map efficiently onto GPUs (Qiu et al., 2024).

The architecture can be stacked, resulting in multi-layer networks of the form f(x1,,xd)=q=02dΦq(p=1dϕq,p(xp))f(x_1, \ldots, x_d) = \sum_{q=0}^{2d} \Phi_{q}\left( \sum_{p=1}^{d} \phi_{q,p}(x_p) \right)7, with specified grid size f(x1,,xd)=q=02dΦq(p=1dϕq,p(xp))f(x_1, \ldots, x_d) = \sum_{q=0}^{2d} \Phi_{q}\left( \sum_{p=1}^{d} \phi_{q,p}(x_p) \right)8 and span f(x1,,xd)=q=02dΦq(p=1dϕq,p(xp))f(x_1, \ldots, x_d) = \sum_{q=0}^{2d} \Phi_{q}\left( \sum_{p=1}^{d} \phi_{q,p}(x_p) \right)9 per layer (yielding ϕq,p\phi_{q,p}0 basis functions per expansion).

3. Training Procedures and Loss Formulations

ReLU-KANs have primarily been evaluated in function approximation and physics-informed neural network (PINN) settings. The canonical training loop involves:

  • Standard mini-batch SGD or Adam optimizer.
  • Loss is typically mean squared error (MSE) on sampled data points for basic function fitting; in PINNs, loss is a weighted sum of PDE residual, boundary, and (if applicable) initial condition terms.

For example, PINN formulations for the Poisson and Burgers' equations are given as: ϕq,p\phi_{q,p}1

ϕq,p\phi_{q,p}2

where ϕq,p\phi_{q,p}3 is a weighting hyperparameter and each term is a sum of squared residuals at collocation points (So et al., 2024).

Typical training involves 1,000–3,000 epochs, random initialization (Xavier), and no additional regularization. Inputs to each KAN layer can be collocated across intervals, with fixed or learnable breakpoints/intervals (Qiu et al., 2024, So et al., 2024).

4. Empirical Performance and Comparative Analysis

ReLU-KAN consistently outperforms classical KANs (B-spline basis) in both convergence speed and final fit accuracy for low-to-moderate order tasks, while being significantly more computationally efficient:

  • Training speedup: 5×–30× faster per epoch on GPU due to removal of recursive B-spline evaluation (Qiu et al., 2024).
  • Function fitting accuracy: Achieves 1–2 orders of magnitude lower MSE on canonical tasks versus KANs with equivalent parameter counts (Qiu et al., 2024).
  • PINN application: On the Poisson equation, ReLU-KAN reduces mean test MSE to 2.18% from 6.8% (KAN), and mean training time to 21.2s from 109s. Similar order-of-magnitude improvements are observed for the Burgers’ equation (So et al., 2024).

However, on more challenging tasks (e.g., image classification), high parameter counts and “hard gating” of features across grid boundaries can degrade generalization relative to parameter-matched MLPs. Table 1 illustrates MNIST benchmarks (Ta et al., 8 Mar 2025):

Model Params FLOPs Val Acc (%)
MLP (SiLU) 52,512 1.8 K 97.72
ReLU-KAN 315,146 630.3 K 96.74

ReLU-KAN achieves perfect training accuracy but lags in validation accuracy despite a much larger parameter and FLOP budget (Ta et al., 8 Mar 2025).

5. Theoretical Properties and Equivalence with ReLU Networks

ReLU-KANs are piecewise polynomial, locally supported, and strictly contain standard ReLU networks in expressive power at the cost of higher parameterization. Explicit constructions establish:

  • Any feed-forward ReLU network can be rewritten exactly as a piecewise-linear KAN of the same depth, by replacing each ReLU layer with a KAN expansion using affine-linear and ReLU inner functions.
  • Conversely, any KAN with piecewise-linear activations (including ReLU-KAN) can be converted into a ReLU network of depth ϕq,p\phi_{q,p}4 (if the original KAN has ϕq,p\phi_{q,p}5 layers), with width and parameter count increased by at most a factor equal to the number of linear regions per univariate basis (Schoots et al., 3 Mar 2025).

This establishes representational equivalence between ReLU-KANs and ReLU networks in the sense of function classes, but the parameter and compute requirements may differ greatly.

Significantly, for a fixed parameter budget, piecewise-linear KANs (and thus ReLU-KANs) can realize a finer polyhedral decomposition of input space than ReLU nets. This implies strictly richer detail is possible for tasks that benefit from local adaptivity, provided one can manage the complexity of univariate ϕq,p\phi_{q,p}6 expansions (Schoots et al., 3 Mar 2025).

6. Computational Complexity and Implementation Considerations

ReLU-KAN layers are fully matrix–matrix operable, comprising:

  • Two broadcast additions: ϕq,p\phi_{q,p}7.
  • Elementwise ReLU and multiplications: ϕq,p\phi_{q,p}8.
  • A grouped 1-dimensional convolution: ϕq,p\phi_{q,p}9 per layer, with Φq\Phi_q0 output channels and Φq\Phi_q1 input channels.
  • Backward pass is of comparable order.

By contrast, classical KANs with B-spline basis require evaluating Φq\Phi_q2 piecewise polynomials per input-output pair, involving conditionals and local support, which are not readily expressed in large matrix operations or mapped efficiently to GPUs (Qiu et al., 2024).

Parameter count typically increases by a factor of Φq\Phi_q3 relative to a width-matched MLP, due to one weight per “bump” per input-output connection (Ta et al., 8 Mar 2025).

7. Limitations, Variants, and Extensions

While ReLU-KAN provides major speed and stability advantages, several limitations have been highlighted:

  • Discontinuous second and higher derivatives at bump boundaries (Φq\Phi_q4 is only Φq\Phi_q5), which hinders accuracy or convergence for high-order PDEs in PINNs.
  • Hard gating: The bell-shaped ReLU basis is identically zero outside its support, which can reduce the network’s ability to approximate features that cross interval boundaries.
  • Parameter inefficiency: On high-dimensional inputs, the requirement for a local basis per input-output edge can inflate both parameter and FLOP budgets by an order of magnitude over MLPs.

Extensions have emerged to address these issues:

  • Higher-order ReLU-KAN (HRKAN): Introduces higher-order ReLU activations with smooth, nonzero higher derivatives, improving performance on challenging differential problems and stabilizing PINN loss convergence (So et al., 2024).
  • AF-KAN: Generalizes the activation palette beyond ReLU (e.g., SiLU, GELU), replaces the convolutional summation with attention mechanisms, and applies normalization for better feature continuity and overall efficiency, particularly in vision tasks (Ta et al., 8 Mar 2025).

These variants have shown improved accuracy, generalization, and parameter efficiency—outperforming ReLU-KANs on a range of classification and regression benchmarks.


References:

  • (Qiu et al., 2024) "ReLU-KAN: New Kolmogorov-Arnold Networks that Only Need Matrix Addition, Dot Multiplication, and ReLU"
  • (So et al., 2024) "Higher-order-ReLU-KANs (HRKANs) for solving physics-informed neural networks (PINNs) more accurately, robustly and faster"
  • (Schoots et al., 3 Mar 2025) "Relating Piecewise Linear Kolmogorov Arnold Networks to ReLU Networks"
  • (Ta et al., 8 Mar 2025) "AF-KAN: Activation Function-Based Kolmogorov-Arnold Networks for Efficient Representation Learning"

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ReLU-KAN.