Grouped KAN Transform in Neural Networks

Updated 14 November 2025

Grouped KAN Transform is a neural module design that partitions channels into groups, reducing parameters and compute complexity through group-specific mappings.
It employs learnable per-group nonlinear functions using basis expansions, balancing efficiency and expressive power in deep architectures.
Empirical results demonstrate that GKT achieves lower parameter counts and FLOPs with improved accuracy in tasks like image segmentation and scientific modeling.

A Grouped KAN Transform refers to a class of neural network module architectures that generalize the Kolmogorov–Arnold Network (KAN) layer by partitioning its inputs and/or parameters into groups to reduce computational complexity, parameter count, and enable better scalability—particularly as an alternative to dense spline-activated layers in deep models such as Transformers and U-Nets. These grouped variants preserve much of KAN’s expressive power while supporting large-scale or memory-constrained deployments, applicable in both vision and scientific domains (Yang et al., 16 Sep 2024, Li et al., 7 Nov 2025, Sapkota et al., 6 Nov 2025, Hu et al., 1 Oct 2024).

1. Mathematical Definition and Core Formulation

The traditional KAN layer maps an input $x \in \mathbb{R}^d$ to $f(x) \in \mathbb{R}^d$ using a dense array of learnable univariate nonlinear functions: $f(x)_j = \sum_{i=1}^d \phi_{i,j}(x_i)$ with each $\phi_{i,j}$ (e.g., spline or rational function) parameterized for every $(i,j)$ pair. Total parameter and compute scale as $O(d^2k)$ , where $k$ is the number of basis parameters per $\phi$ .

The Grouped KAN Transform (GKT) partitions the channel dimension ( $C$ or $d$ ) into $G$ groups, each of size $C_g = C/G$ . Each group $g$ applies a shared learnable mapping $\Phi^{(g)} : \mathbb{R}^{C_g} \to \mathbb{R}^{C_g}$ , where within the group, each output coordinate is a sum of group-specific univariate functions: $[\Phi^{(g)}(\mathbf{x})]_q = \sum_{p=1}^{C_g} \phi_{q,p}^{(g)}(x_p), \quad q=1,\dots,C_g$ with each $\phi_{q,p}^{(g)}$ parameterized over $K$ basis functions.

Resulting complexity for GKT is $O(C^2K/G)$ parameters and $O(B N C^2/G)$ compute per forward pass (for $B$ batches of $N$ tokens). This represents a $G$ -fold reduction in both memory and FLOPs compared to full-channel KAN (Li et al., 7 Nov 2025, Yang et al., 16 Sep 2024).

2. Mechanisms of Complexity Reduction

Model	Parameter Count (Order)	Compute per Forward Pass (Order)
Full KAN	$C^2 K$	$O(B N C^2)$
Grouped KAN	$C^2 K / G$	$O(B N C^2 / G)$

Grouping enables significant reduction in memory and compute by substituting $G$ per-group spline matrices for the full dense $C \times C$ array: each group learns its own $\Phi^{(g)}$ , but shares this function for all inputs/outputs within that group. A plausible implication is that, for suitably chosen $G$ , the model can retain key nonlinear interactions without incurring prohibitive cost.

Empirical evidence with $G=8$ confirms near-linear savings: for example, on medical image segmentation tasks, GroupKAN with GKT achieves $3.02\;\text{M}$ parameters ( $\approx47.6\%$ of U-KAN) and $7.72$ GFLOPs ( $\approx 55\%$ of U-KAN), while achieving higher IoU (79.80\% vs 78.69\%) (Li et al., 7 Nov 2025).

3. Learnable Per-Group Nonlinearities

Grouped KAN Transform leverages group-specific nonlinear mappings, parameterized using basis expansions: $\phi_{q,p}^{(g)}(x) = \sum_{t=1}^K a_{q,p,t}^{(g)} \psi_t(x)$ where $a_{q,p,t}^{(g)}$ are learned and the bases $\psi_t$ may be splines or rational functions. These groupwise mappings ensure rich intra-group expressivity, while reducing redundant global coupling.

In rational-function-based formulations (e.g., Safe Padé Activation Unit/PAU as in KAT), groupwise nonlinearity is applied via

$\phi_i(x_i) = w_i \cdot \frac{P_{j(i)}(x_i)}{1 + |\;Q_{j(i)}(x_i)|}$

for channel group $j(i)$ with shared polynomial coefficients (Yang et al., 16 Sep 2024, Sapkota et al., 6 Nov 2025).

4. Integration Architectures and Implementation

Vision Transformers (KAT, UKAST)

Grouped KAN Transforms replace standard two-layer MLPs in transformer blocks with two successive GKT layers, typically structured as:

1
2
3

u = x + MSA(LayerNorm(x))
v = u + Linear( group_rational( LayerNorm(u) ) )
y = v + Linear( group_rational( LayerNorm(v) ) )

Group rational activation is implemented via custom CUDA kernels or grouped processing, while linear mixing is standard (Yang et al., 16 Sep 2024).

Medical Image Segmentation (GroupKAN)

GKT is deployed in a U-Net style architecture, interleaved with pointwise convolution for cross-group mixing, depthwise convolution for spatial mixing, and residual connections. The per-group transformation is followed by concatenation and normalization:

Input: X ∈ ℝ^{B×N×C}, G groups, per-group spline params
1. Split channels into {X^{(g)}}
2. For g in 1..G:
    a. Reshape, apply Φ^{(g)}, and reshape back
3. Concatenate over g
4. Pass through pointwise/dwconv, activation, residual + norm

(Li et al., 7 Nov 2025).

Swin Transformer Variants (UKAST)

Every feed-forward MLP is replaced by a Group Rational KAN (GR-KAN) with rational base functions shared within groups. Parameter initialization ensures activations are close to the identity at the start, using $P_j(x)\approx x$ and $Q_j(x)\approx 0$ . Numerical stability is enforced with denominators of the form $1+|Q(x)|$ (Sapkota et al., 6 Nov 2025).

Equivariant Architectures (EKAN)

The Grouped KAN Transform can be adapted to enforce equivariance constraints under arbitrary matrix groups, with each group’s spline equipped with gating and equivariant linear mixing. The construction relies on SVD-based projections to maintain parameter matrices within the group-equivariant subspace at all times (Hu et al., 1 Oct 2024).

5. Theoretical Foundation and Empirical Validation

The Kolmogorov–Arnold theorem underpins GKT, guaranteeing that sufficiently rich combinatorics of univariate functions can represent general multivariate functions; the grouping strategy thus achieves expressivity within groups while trading off global coupling for efficiency.

Ablation studies verify that GKT modules confer robust performance:

Removing GKT from GroupKAN reduces mean IoU from 0.6766 to 0.6015 (Li et al., 7 Nov 2025).
Varying group count $G$ shows $G=8$ achieves optimal speed–accuracy trade-off in both transformer and segmentation settings (Yang et al., 16 Sep 2024, Sapkota et al., 6 Nov 2025).
Across vision and scientific domains, grouped KAN methods consistently reduce parameter and FLOP requirements while matching or surpassing full KAN or MLP models in accuracy/MSE.

Statistical testing (Wilcoxon signed-rank) on multiple datasets confirms the improvements in efficiency–accuracy regime are significant at $p<0.05$ (Li et al., 7 Nov 2025).

6. Comparative Performance and Applications

Model	Params (M)	GFLOPs	Mean IoU (%)	Dataset	Remarks
U-KAN (full KAN)	6.35	14.02	78.69	BUSI/GlaS/CVC	Baseline full spline mat
GroupKAN (GKT)	3.02	7.72	79.80	same	+1.11% IoU, 47% params

Grouped KAN Transforms have demonstrable impact in:

Image classification: KAT-Tiny GR-KAN achieves ImageNet top-1 accuracy 74.6% at same param count as ViT-Tiny MLP (72.7%) (Yang et al., 16 Sep 2024).
Semantic segmentation: UPerNet ADE20K KAT-Small attains 46.1% mIoU (vs 43.5% for DeiT-Small, same params) (Yang et al., 16 Sep 2024).
Medical image segmentation: GroupKAN improves over U-KAN in both efficiency and accuracy (Li et al., 7 Nov 2025).
Scientific regression and classification tasks, especially where symmetry constraints are imposed (Hu et al., 1 Oct 2024).

7. Practical Guidelines and Implementation Considerations

Use group size $G=8$ as a baseline for balancing expressiveness and parallel hardware efficiency (Yang et al., 16 Sep 2024, Sapkota et al., 6 Nov 2025).
For rational-function GKT, polynomial orders $M=3, N=4$ (cubic/quartic) with PAU initialization yield stable training.
In grouped spline-based GKT, optimize spline knot positions (fixed or learnable) and coefficients via backpropagation.
CUDA or grouped PyTorch kernels are recommended for high-throughput inference/training.
When enforcing equivariance, project all weight parameters to the symmetry-respecting nullspace after each update via SVD (Hu et al., 1 Oct 2024).
Residual connections and normalization layers should surround each GKT for stable and performant deep networks.
Statistical ablation and dataset-specific hyperparameter searches (e.g., group counts) are advised for new applications.

Grouped KAN Transform thus constitutes a principled, theoretically justified, and practically validated approach for embedding rich, efficient nonlinearity into modern neural architectures by leveraging groupwise parameter sharing, with direct empirical benefit to scalability, accuracy, and interpretability in large-scale vision, scientific, and equivariant models.