- The paper introduces Parameter-Reduced Kolmogorov-Arnold Networks (PRKANs) to significantly reduce the parameter count in KANs while achieving performance competitive with MLPs on MNIST and Fashion-MNIST.
- PRKANs utilize attention mechanisms, dimension summation, feature weight vectors, and convolutional or pooling layers to achieve parameter reduction in KAN layers.
- Experiments indicate that Gaussian Radial Basis Functions and layer normalization improve PRKAN performance, achieving accuracy competitive with MLPs despite requiring slightly longer training times.
The paper introduces Parameter-Reduced Kolmogorov-Arnold Networks (PRKANs) as a method to reduce the number of parameters in Kolmogorov-Arnold Networks (KANs) to be comparable with Multi-Layer Perceptrons (MLPs). The authors present experimental results on MNIST and Fashion-MNIST datasets demonstrating that PRKANs with attention mechanisms rival the performance of MLPs, with slightly longer training times. The authors also highlight the advantages of Gaussian Radial Basis Functions (GRBFs) and layer normalization in KAN designs.
The paper begins by addressing the Kolmogorov-Arnold Representation Theorem (KART), which posits that any continuous function involving multiple variables can be decomposed into a sum of continuous functions of single variables. The authors note that while KANs have shown promise in various applications, they often require significantly more parameters than MLPs.
The core contributions of the paper include:
- The development of PRKANs, which employ attention mechanisms, dimension summation, feature weight vectors, and convolutional/pooling layers to reduce the parameter count in KAN layers.
- A demonstration of the competitive performance of PRKANs compared to MLPs on the MNIST and Fashion-MNIST datasets.
- An exploration of components, such as GRBFs and layer normalization, that can contribute to the output performance of PRKANs.
The paper details the methodology behind PRKANs, including:
- A review of KART and its implications for neural network design.
- A discussion of the design of KAN architectures, including the use of learnable activation functions such as B-splines.
- An analysis of the parameter requirements in KANs versus MLPs, highlighting the need for parameter reduction techniques.
- A description of the proposed PRKAN architecture, which incorporates attention mechanisms, dimension summation, feature weight vectors, and convolutional/pooling layers to reduce the number of parameters in KAN layers. The authors present equations defining the operation of each of these components, including:
- Attention Mechanism:
Xspline∈RB×D×(G+k) which represents the spline data.
B is the batch size, D is the data dimension, G is the grid size of a function, and k is the spline order.
$X_{\text{linear} = W_{\text{linear} \times X_{spline} + b_{\text{linear}, \quad X_{\text{linear} \in \mathbb{R}^{B \times D \times 1}$ where $W_{\text{linear}$ and $b_{\text{linear}$ are the weight and bias of a linear transformation.
$W_{\text{att} = softmax(X_{\text{linear}, \text{dim}=-2), \quad W_{\text{att} \in \mathbb{R}^{B \times D \times 1}$ where $W_{\text{att}$ represents the attention weights.
$X' = X_{spline} \odot W_{\text{att}, \quad X' \in \mathbb{R}^{B \times D \times (G + k)}$ where ⊙ denotes element-wise multiplication.
X′′=dim=−1∑X′,X′′∈RB×D where X′′ is the summation along the last dimension.
$X_{out} = W_{\text{out} \times \sigma (X'') + b_{\text{out}, \quad X_{out} \in \mathbb{R}^{B \times d_{\text{out}}$ where $W_{\text{out}$ and $b_{\text{out}$ are the weight and bias of a linear transformation, σ is an activation function, and $d_{\text{out}$ is the output dimension.
- Convolution Layers:
Xspline∈RB×D×(G+k) which represents the spline data.
$X_{\text{perm} = permute(X_{spline}, 0, 2, 1), \quad X_{\text{perm} \in \mathbb{R}^{B \times (G+k) \times D}$ where $X_{\text{perm}$ is the permuted tensor.
$X_{\text{conv} = W_{\text{conv} \times X_{\text{perm} + b_{\text{conv}, \quad X_{\text{conv} \in \mathbb{R}^{B \times 1 \times D}$ where $W_{\text{conv}$ and $b_{\text{conv}$ are the weight and bias of the 1D convolution.
$X_{\text{squeeze} = squeeze(X_{\text{conv},1), \quad X_{\text{squeeze} \in \mathbb{R}^{B \times D}$ where $X_{\text{squeeze}$ is the squeezed tensor.
$X_{\text{out} = W_{\text{out} \times \sigma (X_{\text{squeeze}) + b_{\text{out}, \quad X_{\text{out} \in \mathbb{R}^{B \times d_{out}}$ where $W_{\text{out}$ and $b_{\text{out}$ are the weight and bias of a linear transformation, σ is an activation function, and $d_{\text{out}$ is the output dimension.
- Convolution Layers + Pooling Layers:
Xspline∈RB×D×(G+k) which represents the spline data.
$X_{\text{perm} = permute(X_{spline}, 0, 2, 1), \quad X_{\text{perm} \in \mathbb{R}^{B \times (G+k) \times D}$ where $X_{\text{perm}$ is the permuted tensor.
$X_{\text{conv} = W_{\text{conv} \times X_{\text{perm} + b_{\text{conv}, \quad X_{\text{conv} \in \mathbb{R}^{B \times (G+k) \times D}$ where $W_{\text{conv}$ and $b_{\text{conv}$ are the weight and bias of the 1D convolution.
$X_{\text{pool} = pool(X_{\text{conv}), \quad X_{\text{pool} \in \mathbb{R}^{B \times (G+k) \times \frac{D}{G+k}}$ where $X_{\text{pool}$ is the max pooling.
$X_{\text{reshaped} = reshape(X_{\text{pool}), \quad X_{\text{reshaped} \in \mathbb{R}^{B \times D}$ where $X_{\text{reshaped}$ is the reshaped tensor.
$X_{\text{out} = W_{\text{out} \times X_{\text{reshaped} + b_{\text{out}, \quad X_{\text{out} \in \mathbb{R}^{B \times d_{\text{out}}$ where $W_{\text{out}$ and $b_{\text{out}$ are the weight and bias of a linear transformation, and $d_{\text{out}$ is the output dimension.
- Dimension Summation:
$X_{\text{spline} \in \mathbb{R}^{B \times D \times (G+k)}$ which represents the spline data.
X′=dim=−1∑Xspline,X′∈RB×D where X′ is the summation along the last dimension.
$X_{\text{out} = W_{\text{out} \times \sigma(X') + b_{\text{out}, \quad X_{\text{out} \in \mathbb{R}^{B \times d_{\text{out}}$ where $W_{\text{out}$ and $b_{\text{out}$ are the weight and bias of a linear transformation, σ is an activation function, and $d_{\text{out}$ is the output dimension.
- Feature Weight Vectors:
Xspline∈RB×D×(G+k) which represents the spline data.
X′=Xspline×M,M∈R(G+k)×1,X′∈RB×D where M is the learnable feature vector.
$X_{\text{out} = W_{\text{out} \times \sigma(X') + b_{\text{out}, \quad X_{\text{out} \in \mathbb{R}^{B \times d_{\text{out}}$ where $W_{\text{out}$ and $b_{\text{out}$ are the weight and bias of a linear transformation, σ is an activation function, and $d_{\text{out}$ is the output dimension.
- A discussion of data normalization techniques, such as batch normalization and layer normalization, and their impact on model performance.
The paper presents experimental results comparing PRKANs with MLPs on the MNIST and Fashion-MNIST datasets. The authors trained each model over 5 independent runs and reported average values for metrics such as training accuracy, validation accuracy, F1 score, and training time. The results showed that PRKANs with attention mechanisms achieved competitive performance compared to MLPs, with slightly longer training times. The authors also found that GRBFs and layer normalization generally provided more benefits when applied to PRKANs. For example, with batch normalization, PRKAN-attn models achieve an improvement of 1.34% and 0.58% in validation accuracy on MNIST and Fashion-MNIST, respectively. With layer normalization, PRKAN-attn achieves a validation accuracy of 97.46% on MNIST, trailing MLP by a margin of less than 0.26\%.
The paper includes ablation studies on the activation functions used in PRKANs, a comparison between RBFs and B-splines, and a suggestion for the positioning of data normalization in PRKANs. The ablation paper on activation functions showed that SiLU achieved the best validation accuracy and F1 score on the MNIST dataset while delivering competitive performance on Fashion-MNIST. The comparison between RBFs and B-splines showed that RBFs were 11% to 13% faster than B-splines. The paper on the positioning of data normalization showed that layer normalization was generally more effective than batch normalization.
The paper concludes by discussing the limitations of the research and suggesting directions for future work. The authors note that the PRKANs were tested on relatively simple datasets and that more research is needed to evaluate the scalability and efficiency of PRKANs in more complex models. The authors also suggest exploring other parameter reduction strategies, such as tensor decomposition, matrix factorization, or advanced pruning.