Grouped KAN Transform in Neural Networks
- Grouped KAN Transform is a neural module design that partitions channels into groups, reducing parameters and compute complexity through group-specific mappings.
- It employs learnable per-group nonlinear functions using basis expansions, balancing efficiency and expressive power in deep architectures.
- Empirical results demonstrate that GKT achieves lower parameter counts and FLOPs with improved accuracy in tasks like image segmentation and scientific modeling.
A Grouped KAN Transform refers to a class of neural network module architectures that generalize the Kolmogorov–Arnold Network (KAN) layer by partitioning its inputs and/or parameters into groups to reduce computational complexity, parameter count, and enable better scalability—particularly as an alternative to dense spline-activated layers in deep models such as Transformers and U-Nets. These grouped variants preserve much of KAN’s expressive power while supporting large-scale or memory-constrained deployments, applicable in both vision and scientific domains (Yang et al., 16 Sep 2024, Li et al., 7 Nov 2025, Sapkota et al., 6 Nov 2025, Hu et al., 1 Oct 2024).
1. Mathematical Definition and Core Formulation
The traditional KAN layer maps an input to using a dense array of learnable univariate nonlinear functions: with each (e.g., spline or rational function) parameterized for every pair. Total parameter and compute scale as , where is the number of basis parameters per .
The Grouped KAN Transform (GKT) partitions the channel dimension ( or ) into groups, each of size . Each group applies a shared learnable mapping , where within the group, each output coordinate is a sum of group-specific univariate functions: with each parameterized over basis functions.
Resulting complexity for GKT is parameters and compute per forward pass (for batches of tokens). This represents a -fold reduction in both memory and FLOPs compared to full-channel KAN (Li et al., 7 Nov 2025, Yang et al., 16 Sep 2024).
2. Mechanisms of Complexity Reduction
| Model | Parameter Count (Order) | Compute per Forward Pass (Order) |
|---|---|---|
| Full KAN | ||
| Grouped KAN |
Grouping enables significant reduction in memory and compute by substituting per-group spline matrices for the full dense array: each group learns its own , but shares this function for all inputs/outputs within that group. A plausible implication is that, for suitably chosen , the model can retain key nonlinear interactions without incurring prohibitive cost.
Empirical evidence with confirms near-linear savings: for example, on medical image segmentation tasks, GroupKAN with GKT achieves parameters ( of U-KAN) and $7.72$ GFLOPs ( of U-KAN), while achieving higher IoU (79.80\% vs 78.69\%) (Li et al., 7 Nov 2025).
3. Learnable Per-Group Nonlinearities
Grouped KAN Transform leverages group-specific nonlinear mappings, parameterized using basis expansions: where are learned and the bases may be splines or rational functions. These groupwise mappings ensure rich intra-group expressivity, while reducing redundant global coupling.
In rational-function-based formulations (e.g., Safe Padé Activation Unit/PAU as in KAT), groupwise nonlinearity is applied via
for channel group with shared polynomial coefficients (Yang et al., 16 Sep 2024, Sapkota et al., 6 Nov 2025).
4. Integration Architectures and Implementation
Vision Transformers (KAT, UKAST)
Grouped KAN Transforms replace standard two-layer MLPs in transformer blocks with two successive GKT layers, typically structured as:
1 2 3 |
u = x + MSA(LayerNorm(x)) v = u + Linear( group_rational( LayerNorm(u) ) ) y = v + Linear( group_rational( LayerNorm(v) ) ) |
Medical Image Segmentation (GroupKAN)
GKT is deployed in a U-Net style architecture, interleaved with pointwise convolution for cross-group mixing, depthwise convolution for spatial mixing, and residual connections. The per-group transformation is followed by concatenation and normalization:
1 2 3 4 5 6 |
Input: X ∈ ℝ^{B×N×C}, G groups, per-group spline params 1. Split channels into {X^{(g)}} 2. For g in 1..G: a. Reshape, apply Φ^{(g)}, and reshape back 3. Concatenate over g 4. Pass through pointwise/dwconv, activation, residual + norm |
Swin Transformer Variants (UKAST)
Every feed-forward MLP is replaced by a Group Rational KAN (GR-KAN) with rational base functions shared within groups. Parameter initialization ensures activations are close to the identity at the start, using and . Numerical stability is enforced with denominators of the form $1+|Q(x)|$ (Sapkota et al., 6 Nov 2025).
Equivariant Architectures (EKAN)
The Grouped KAN Transform can be adapted to enforce equivariance constraints under arbitrary matrix groups, with each group’s spline equipped with gating and equivariant linear mixing. The construction relies on SVD-based projections to maintain parameter matrices within the group-equivariant subspace at all times (Hu et al., 1 Oct 2024).
5. Theoretical Foundation and Empirical Validation
The Kolmogorov–Arnold theorem underpins GKT, guaranteeing that sufficiently rich combinatorics of univariate functions can represent general multivariate functions; the grouping strategy thus achieves expressivity within groups while trading off global coupling for efficiency.
Ablation studies verify that GKT modules confer robust performance:
- Removing GKT from GroupKAN reduces mean IoU from 0.6766 to 0.6015 (Li et al., 7 Nov 2025).
- Varying group count shows achieves optimal speed–accuracy trade-off in both transformer and segmentation settings (Yang et al., 16 Sep 2024, Sapkota et al., 6 Nov 2025).
- Across vision and scientific domains, grouped KAN methods consistently reduce parameter and FLOP requirements while matching or surpassing full KAN or MLP models in accuracy/MSE.
Statistical testing (Wilcoxon signed-rank) on multiple datasets confirms the improvements in efficiency–accuracy regime are significant at (Li et al., 7 Nov 2025).
6. Comparative Performance and Applications
| Model | Params (M) | GFLOPs | Mean IoU (%) | Dataset | Remarks |
|---|---|---|---|---|---|
| U-KAN (full KAN) | 6.35 | 14.02 | 78.69 | BUSI/GlaS/CVC | Baseline full spline mat |
| GroupKAN (GKT) | 3.02 | 7.72 | 79.80 | same | +1.11% IoU, 47% params |
Grouped KAN Transforms have demonstrable impact in:
- Image classification: KAT-Tiny GR-KAN achieves ImageNet top-1 accuracy 74.6% at same param count as ViT-Tiny MLP (72.7%) (Yang et al., 16 Sep 2024).
- Semantic segmentation: UPerNet ADE20K KAT-Small attains 46.1% mIoU (vs 43.5% for DeiT-Small, same params) (Yang et al., 16 Sep 2024).
- Medical image segmentation: GroupKAN improves over U-KAN in both efficiency and accuracy (Li et al., 7 Nov 2025).
- Scientific regression and classification tasks, especially where symmetry constraints are imposed (Hu et al., 1 Oct 2024).
7. Practical Guidelines and Implementation Considerations
- Use group size as a baseline for balancing expressiveness and parallel hardware efficiency (Yang et al., 16 Sep 2024, Sapkota et al., 6 Nov 2025).
- For rational-function GKT, polynomial orders (cubic/quartic) with PAU initialization yield stable training.
- In grouped spline-based GKT, optimize spline knot positions (fixed or learnable) and coefficients via backpropagation.
- CUDA or grouped PyTorch kernels are recommended for high-throughput inference/training.
- When enforcing equivariance, project all weight parameters to the symmetry-respecting nullspace after each update via SVD (Hu et al., 1 Oct 2024).
- Residual connections and normalization layers should surround each GKT for stable and performant deep networks.
- Statistical ablation and dataset-specific hyperparameter searches (e.g., group counts) are advised for new applications.
Grouped KAN Transform thus constitutes a principled, theoretically justified, and practically validated approach for embedding rich, efficient nonlinearity into modern neural architectures by leveraging groupwise parameter sharing, with direct empirical benefit to scalability, accuracy, and interpretability in large-scale vision, scientific, and equivariant models.