Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 175 tok/s
Gemini 2.5 Pro 54 tok/s Pro
GPT-5 Medium 38 tok/s Pro
GPT-5 High 37 tok/s Pro
GPT-4o 108 tok/s Pro
Kimi K2 180 tok/s Pro
GPT OSS 120B 447 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

Grouped KAN Activation: Efficient Neural Nonlinearities

Updated 14 November 2025
  • Grouped KAN Activation is a scalable approach that partitions feature channels into groups, using a shared learnable nonlinearity (spline or rational) to maintain universal approximation while reducing complexity.
  • It significantly lowers parameter counts and computational costs by tying activations across groups, enabling efficient deployment in high-dimensional models such as speech enhancement and medical image segmentation.
  • Empirical results confirm that GKA improves model interpretability and performance, balancing expressivity and efficiency in applications ranging from equivariant learning to real-time inference.

Grouped KAN Activation (GKA) refers to a class of parameter-efficient, learnable nonlinearities within the Kolmogorov–Arnold Network (KAN) framework, wherein spline-based or rational activation functions are shared across defined channel groups rather than per-input or per-edge. This approach addresses the scalability bottleneck of naive KAN activations, maintaining universal approximation power while drastically reducing the parameter count and computational cost, making KAN-based models viable for high-dimensional architectures in speech enhancement, medical image segmentation, and equivariant learning.

1. Mathematical Formulation and Principles

Grouped KAN Activation operates by partitioning the feature channels of a neural layer into disjoint groups, each assigned a learnable activation function (spline-based or rational). Formally, for an input tensor TRB×N×CT \in \mathbb{R}^{B \times N \times C} (batch BB, spatial tokens NN, channels CC), the channels are split into GG groups of size Cg=C/GC_g = C / G. Each group gg is equipped with either:

  • A groupwise collection of univariate spline activations {φg,j()j=1,,Cg}\{\varphi_{g,j}(\cdot) \mid j = 1, \ldots, C_g\}, or
  • A single shared groupwise activation φg()\varphi_g(\cdot) applying identically across all CgC_g channels (parameter tying).

For each group, the groupwise activation is applied channel-wise (diagonal map):

Tb,n,c=φg(c),j(c)(Tb,n,c),T'_{b,n,c} = \varphi_{g(c),j(c)}(T_{b,n,c}),

where g(c)=c/Cgg(c) = \lceil c / C_g \rceil, j(c)=c(g1)Cgj(c) = c - (g-1) C_g. The functional form of each φg,j(x)\varphi_{g,j}(x) is typically a cubic B-spline expansion,

φg,j(x)=k=0Kag,j,kBk3(x),\varphi_{g,j}(x) = \sum_{k=0}^{K} a_{g,j,k} B_k^3(x),

with KK control points and basis functions Bk3(x)B_k^3(x), or a rational polynomial as in rational KAN variants:

φ(x)=m=0Ma,mxm1+n=1Nb,nxn.\varphi_{\ell}(x) = \frac{\sum_{m=0}^M a_{\ell,m} x^m}{1 + \sum_{n=1}^N b_{\ell,n} x^n}.

All parameters {ag,j,k}\{a_{g,j,k}\} are learned via backpropagation. Knot positions can be fixed (e.g., uniform over [3,3][-3,3]) or adapted in training.

2. Motivation and Comparison to Standard KAN

Vanilla KAN models assign an independent univariate activation φi,j\varphi_{i,j} to each input–output pair (edge) in a layer, resulting in O(IOP)O(I \cdot O \cdot P) parameters (input II, output OO, PP spline parameters per function), causing memory and compute overhead. In fully connected layers or high-dimensional convolutional blocks, this quadratic scaling is intractable.

Grouped KAN Activation reduces this burden by:

  • Sharing each φg\varphi_{g} across all channels in a group, lowering parameter complexity from O(CK)O(C\cdot K) to O(GK)O(G\cdot K) with aggressive tying.
  • Encouraging parameter efficiency and regularity by constraining the diversity of channelwise nonlinearities, yielding improved robustness and interpretability.

This approach preserves KAN's universal function approximation ability while rendering large-scale deployments feasible in neural networks with substantial width (e.g., C100C\gg 100).

3. Implementation: Layerwise Integration and Pseudocode

The GKA module is implemented as a post-linear (or convolutional) activation stage. For a batchwise tensor TT:

  1. Channel grouping: Partition channels into GG groups.
  2. Vectorization: Flatten spatial dimensions for parallelized processing.
  3. Spline evaluation: For each scalar, identify the relevant knot interval and evaluate the spline expansion via the local basis.
  4. Concatenation: Concatenate groupwise outputs to reconstitute the original channel order.

Sample pseudocode for per-group, per-channel splines (from (Li et al., 7 Nov 2025)):

1
2
3
4
5
6
7
8
9
10
11
12
13
C_g = C // G
for g in range(G):
    for b in range(B):
        for n in range(N):
            for j in range(C_g):
                c = g * C_g + j
                x = T[b][n][c]
                # Spline evaluation: y = sum_k a[g][j][k] * B_k^d(x)
                y = 0
                l = find_knot_interval(x, t)
                for m in range(l-d, l+1):
                    y += a[g][j][m] * B_d(m, x, t)
                T_out[b][n][c] = y

For rational activations, the numerator and denominator coefficients are used in place of spline weights, with group index determining the subset.

4. Scalability and Parameter Efficiency

Grouped KAN Activation enables:

  • Reduction in activation parameters by up to CgC_g-fold, i.e., from O(CK)O(C \cdot K) to O(GK)O(G \cdot K).
  • Lower memory footprint and FLOPS per token: full per-channel GKA has O(C)O(C) cost; with sharing, this drops to O(G)O(G).
  • In models where further tying is feasible, memory and computational requirements decrease by an additional order, facilitating deployment in resource-constrained or real-time settings.

A plausible implication is that tuning GG allows for navigation between expressivity (large GG) and compactness/regularization (small GG).

5. Empirical Applications and Observed Benefits

Speech Enhancement: In "From KAN to GR-KAN," Group-Rational KAN is inserted into DNN-based speech enhancement models (Demucs, MP-SENet). GR-KAN replaces ReLU or GELU activations with groupwise rational activations. Parameter counts decrease by up to 4x, inference cost rises insignificantly (<<10 GFLOPs), and perceptual quality (PESQ) improves by 0.05–0.10 points. Regularized group sharing with k=1624k=16\ldots24 achieves near-identical capacity to vanilla KAN but remains tractable (Li et al., 23 Dec 2024).

Medical Image Segmentation: In GroupKAN, GKA modules interleave learnable, groupwise spline activations into encoder–decoder Transformer architectures. On BUSI, GlaS, and CVC benchmarks, GroupKAN outperforms U-KAN (+1.11% IoU) and vanilla attention/convolution, while using only 47.6% of parameters (3.02M vs 6.35M) (Li et al., 7 Nov 2025). GKA also improves interpretability: learned activation curves can be directly inspected and often align with domain heuristics (e.g., suppressing negative activations, emphasizing small positives relevant to pathology signals).

Equivariant Learning: In tasks requiring matrix group-equivariance, grouping is exploited across tensor representation fibers, parameter sharing is enforced among equivalent features, and spline basis functions are gated and combined equivariantly. This parameter sharing further reduces redundancy while preserving symmetry constraints (Hu et al., 1 Oct 2024).

6. Architectural Variants and Hyperparameters

Key adjustable elements:

Parameter Effect Typical Choices
Groups GG or kk Expressivity vs. efficiency G=8G=8, $16$, $24$
Knots KK Capacity of spline basis K=812K=8\ldots12, cubic
Spline degree dd Smoothness/flexibility d=3d=3 (cubic)
Regularization λ\lambda Spline smoothness, prevents oscillation λ[104,102]\lambda \in [10^{-4}, 10^{-2}]
Knot range Domain coverage [3,3][-3, 3]

Empirical results suggest the "sweet spot" for GG balances between maximal sharing (G=1G=1) and maximal flexibility (G=CG=C).

7. Interpretability, Regularization, and Practical Considerations

Grouped KAN Activation provides structured, domain-adaptive nonlinearity that is both data-driven and easily visualized. End-of-training plots of spline shapes can reveal function regimes exploited by the network, aiding clinical or scientific interpretability. Regularization on spline weights (e.g., second finite differences) is recommended to avoid overfitting or oscillatory activations. In practice, linear initialization of knots and careful group size selection are effective. When used with convolutional or Transformer blocks, GKA can be implemented fully vectorized for minimal runtime overhead.

Grouped KAN Activation thus anchors the recent advances in practical KAN deployment, providing a scalable, interpretable, and high-performance alternative to conventional activation functions, and enabling plug-and-play integration into modern neural architectures for domains requiring both efficiency and expressive nonlinearity (Li et al., 23 Dec 2024, Li et al., 7 Nov 2025, Hu et al., 1 Oct 2024).

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Grouped KAN Activation.