Grouped KAN Activation: Efficient Neural Nonlinearities
- Grouped KAN Activation is a scalable approach that partitions feature channels into groups, using a shared learnable nonlinearity (spline or rational) to maintain universal approximation while reducing complexity.
- It significantly lowers parameter counts and computational costs by tying activations across groups, enabling efficient deployment in high-dimensional models such as speech enhancement and medical image segmentation.
- Empirical results confirm that GKA improves model interpretability and performance, balancing expressivity and efficiency in applications ranging from equivariant learning to real-time inference.
Grouped KAN Activation (GKA) refers to a class of parameter-efficient, learnable nonlinearities within the Kolmogorov–Arnold Network (KAN) framework, wherein spline-based or rational activation functions are shared across defined channel groups rather than per-input or per-edge. This approach addresses the scalability bottleneck of naive KAN activations, maintaining universal approximation power while drastically reducing the parameter count and computational cost, making KAN-based models viable for high-dimensional architectures in speech enhancement, medical image segmentation, and equivariant learning.
1. Mathematical Formulation and Principles
Grouped KAN Activation operates by partitioning the feature channels of a neural layer into disjoint groups, each assigned a learnable activation function (spline-based or rational). Formally, for an input tensor (batch , spatial tokens , channels ), the channels are split into groups of size . Each group is equipped with either:
- A groupwise collection of univariate spline activations , or
- A single shared groupwise activation applying identically across all channels (parameter tying).
For each group, the groupwise activation is applied channel-wise (diagonal map):
where , . The functional form of each is typically a cubic B-spline expansion,
with control points and basis functions , or a rational polynomial as in rational KAN variants:
All parameters are learned via backpropagation. Knot positions can be fixed (e.g., uniform over ) or adapted in training.
2. Motivation and Comparison to Standard KAN
Vanilla KAN models assign an independent univariate activation to each input–output pair (edge) in a layer, resulting in parameters (input , output , spline parameters per function), causing memory and compute overhead. In fully connected layers or high-dimensional convolutional blocks, this quadratic scaling is intractable.
Grouped KAN Activation reduces this burden by:
- Sharing each across all channels in a group, lowering parameter complexity from to with aggressive tying.
- Encouraging parameter efficiency and regularity by constraining the diversity of channelwise nonlinearities, yielding improved robustness and interpretability.
This approach preserves KAN's universal function approximation ability while rendering large-scale deployments feasible in neural networks with substantial width (e.g., ).
3. Implementation: Layerwise Integration and Pseudocode
The GKA module is implemented as a post-linear (or convolutional) activation stage. For a batchwise tensor :
- Channel grouping: Partition channels into groups.
- Vectorization: Flatten spatial dimensions for parallelized processing.
- Spline evaluation: For each scalar, identify the relevant knot interval and evaluate the spline expansion via the local basis.
- Concatenation: Concatenate groupwise outputs to reconstitute the original channel order.
Sample pseudocode for per-group, per-channel splines (from (Li et al., 7 Nov 2025)):
1 2 3 4 5 6 7 8 9 10 11 12 13 |
C_g = C // G for g in range(G): for b in range(B): for n in range(N): for j in range(C_g): c = g * C_g + j x = T[b][n][c] # Spline evaluation: y = sum_k a[g][j][k] * B_k^d(x) y = 0 l = find_knot_interval(x, t) for m in range(l-d, l+1): y += a[g][j][m] * B_d(m, x, t) T_out[b][n][c] = y |
For rational activations, the numerator and denominator coefficients are used in place of spline weights, with group index determining the subset.
4. Scalability and Parameter Efficiency
Grouped KAN Activation enables:
- Reduction in activation parameters by up to -fold, i.e., from to .
- Lower memory footprint and FLOPS per token: full per-channel GKA has cost; with sharing, this drops to .
- In models where further tying is feasible, memory and computational requirements decrease by an additional order, facilitating deployment in resource-constrained or real-time settings.
A plausible implication is that tuning allows for navigation between expressivity (large ) and compactness/regularization (small ).
5. Empirical Applications and Observed Benefits
Speech Enhancement: In "From KAN to GR-KAN," Group-Rational KAN is inserted into DNN-based speech enhancement models (Demucs, MP-SENet). GR-KAN replaces ReLU or GELU activations with groupwise rational activations. Parameter counts decrease by up to 4x, inference cost rises insignificantly (10 GFLOPs), and perceptual quality (PESQ) improves by 0.05–0.10 points. Regularized group sharing with achieves near-identical capacity to vanilla KAN but remains tractable (Li et al., 23 Dec 2024).
Medical Image Segmentation: In GroupKAN, GKA modules interleave learnable, groupwise spline activations into encoder–decoder Transformer architectures. On BUSI, GlaS, and CVC benchmarks, GroupKAN outperforms U-KAN (+1.11% IoU) and vanilla attention/convolution, while using only 47.6% of parameters (3.02M vs 6.35M) (Li et al., 7 Nov 2025). GKA also improves interpretability: learned activation curves can be directly inspected and often align with domain heuristics (e.g., suppressing negative activations, emphasizing small positives relevant to pathology signals).
Equivariant Learning: In tasks requiring matrix group-equivariance, grouping is exploited across tensor representation fibers, parameter sharing is enforced among equivalent features, and spline basis functions are gated and combined equivariantly. This parameter sharing further reduces redundancy while preserving symmetry constraints (Hu et al., 1 Oct 2024).
6. Architectural Variants and Hyperparameters
Key adjustable elements:
| Parameter | Effect | Typical Choices |
|---|---|---|
| Groups or | Expressivity vs. efficiency | , $16$, $24$ |
| Knots | Capacity of spline basis | , cubic |
| Spline degree | Smoothness/flexibility | (cubic) |
| Regularization | Spline smoothness, prevents oscillation | |
| Knot range | Domain coverage |
Empirical results suggest the "sweet spot" for balances between maximal sharing () and maximal flexibility ().
7. Interpretability, Regularization, and Practical Considerations
Grouped KAN Activation provides structured, domain-adaptive nonlinearity that is both data-driven and easily visualized. End-of-training plots of spline shapes can reveal function regimes exploited by the network, aiding clinical or scientific interpretability. Regularization on spline weights (e.g., second finite differences) is recommended to avoid overfitting or oscillatory activations. In practice, linear initialization of knots and careful group size selection are effective. When used with convolutional or Transformer blocks, GKA can be implemented fully vectorized for minimal runtime overhead.
Grouped KAN Activation thus anchors the recent advances in practical KAN deployment, providing a scalable, interpretable, and high-performance alternative to conventional activation functions, and enabling plug-and-play integration into modern neural architectures for domains requiring both efficiency and expressive nonlinearity (Li et al., 23 Dec 2024, Li et al., 7 Nov 2025, Hu et al., 1 Oct 2024).