Rational KANs (rKANs): Efficient Function Approximation
- Rational KANs (rKANs) are deep learning architectures that replace spline bases with rational function units to achieve efficient, high-precision function approximation.
- The framework leverages Safe Padé units, using Padé approximants and rational Jacobi polynomials, to enhance expressivity and ensure numerical stability during training.
- Grouped Rational KANs (GR-KANs) reduce parameter overhead by sharing rational bases across channel groups, yielding robust performance in tasks such as medical image segmentation.
Rational Kolmogorov-Arnold Networks (rKANs) are a class of deep learning architectures that generalize the Kolmogorov-Arnold representation using rational function basis units, such as Padé approximants and rational Jacobi polynomials, to enhance expressive power and computational efficiency. Originating as an evolution of classical Kolmogorov-Arnold networks (KANs), rKANs address the implementation complexity and parameter inefficiency of earlier spline-based KANs, offering robust alternatives for high-precision function approximation and data-efficient representation learning in both classical and modern neural architectures (Aghaei, 2024, Sapkota et al., 6 Nov 2025).
1. The Kolmogorov-Arnold Framework and Rational Basis Functions
The Kolmogorov-Arnold superposition theorem states that any continuous multivariate function on a compact domain can, in principle, be decomposed into sums of univariate "edge" functions and an outer univariate function. KANs implement this concept by learning banks of univariate basis functions along each network channel.
To replace the B-spline bases historically used in KANs (which require a high number of parameters and knots, making implementation and training complex), rKANs introduce rational function bases. The principal formulation is the "Safe Padé Unit", defined by:
where:
- (degree- polynomial)
- (degree- polynomial, no constant term)
- is a learnable scalar uniquely per input–output channel edge
Empirical evidence suggests , provide favorable expressivity-to-stability tradeoffs. The denominator $1+|Q(x)|$ ensures numerical stability by preventing poles and gradient explosions during training. Classical Padé approximation theory guarantees that rational functions are dense in and can succinctly represent functions with sharp transitions or near-singularities more efficiently than pure polynomials (Aghaei, 2024, Sapkota et al., 6 Nov 2025).
2. Group Rational KANs (GR-KANs) and Parameter Efficiency
Vanilla spline-based KANs require a distinct basis for each input-to-output channel mapping, implicating univariate basis functions and their respective knot parameters—a strategy that incurs significant parameter and memory overhead.
The grouped parameterization of rKANs, termed "Group Rational KANs" (GR-KANs), partitions input channels into groups (e.g., ). Within each group, all channels share a single rational base function , defined by a common set of coefficients . Each input–output edge retains a learnable scalar , ensuring per-edge adaptivity while vastly reducing the total number of parameters:
- Unique rational coefficients:
- Additional scalar weights: (as in a linear layer)
This grouped strategy substantially compresses the parameter space compared to the scaling of vanilla KANs, yielding marked efficiency gains (Sapkota et al., 6 Nov 2025).
| Model | FFN Type | Residual Conv? | GFLOPs | # Params |
|---|---|---|---|---|
| SwinUNETR | MLP | No | 1.2500 | 6.302M |
| UKAST | GR-KAN | No | 1.2467 | 6.302M + 608 |
| SwinUNETR+RC | MLP | Yes | 1.4419 | 7.183M |
| UKAST+RC | GR-KAN | Yes | 1.4386 | 7.184M |
This table demonstrates that substituting MLPs with GR-KANs yields a reduction in FLOPs of approximately 0.3–0.4%, with only several hundred additional parameters.
3. Network Integration and Architectural Usage
rKANs can function as core components in both classical and transformer-based neural architectures. In the UKAST architecture for medical image segmentation, GR-KANs are employed as drop-in replacements for the feedforward MLP+GELU blocks in the Swin Transformer stages. The architecture comprises:
- An encoder of four Swin Transformer stages, each stage using residual convolutions followed by two GR-KAN-augmented feedforward blocks.
- A CNN-style decoder with upsampling and skip connections for segmentation.
A single stage of the Swin+GR-KAN block operates as follows (suppressing stage superscripts):
- (residual convolution)
- (window self-attention)
- (feedforward block 1)
- (shifted-window attention)
- (feedforward block 2)
Every GR-KAN layer maintains normalization and residual pathways, substituting only the nonlinearity and basis encoding mechanisms (Sapkota et al., 6 Nov 2025).
4. Training Protocols and Regularization
For rKAN-based models as in the UKAST system, the standard training regime includes:
- Optimizer: AdamW with learning rate and weight decay
- Training duration: 400 epochs with cosine annealing learning rate, batch size 24, single NVIDIA A10 GPU
- Loss: Combined Dice and cross-entropy
- Data augmentation: Random 320×320 crops, flips, rotations, Gaussian noise (masks matched to augmentations)
- Inference with overlapping sliding windows (50% overlap)
- Regularization: The intrinsic Safe Padé denominator ($1+|Q(x)|$) provides gradient stability, precluding the need for additional dropout or weight clipping beyond routine weight decay (Sapkota et al., 6 Nov 2025).
5. Empirical Evaluation and Ablation Analyses
Extensive benchmarking across diverse datasets substantiates the performance advantages of rKANs and GR-KANs, especially regarding expressivity and data efficiency. Selected quantitative results are outlined below:
Full-data Dice Score (%)
| Dataset | U-Net | UNETR | SwinUNETR+RC | UKAST |
|---|---|---|---|---|
| Kvasir2D | 82.3 | 63.7 | 81.9 | 81.7 (–0.2) |
| ISIC2D | 79.3 | 74.7 | 78.9 | 79.9 (+1.0) |
| BCV3D | 70.0 | 52.7 | 68.9 | 71.2 (+2.3) |
| MMWHS3D | 73.4 | 70.3 | 80.4 | 80.8 (+0.4) |
On scarce annotation regimes (10% of training data), UKAST shows enhanced robustness:
- BCV3D@10%: SwinUNETR+RC 60.1 → UKAST 63.9 (+3.8)
- ISIC2D@10%: SwinUNETR+RC 69.8 → UKAST 71.4 (+1.6)
Ablation Across Architectures (Average Dice %):
| Configuration | 2D | 3D |
|---|---|---|
| ViT + MLP | 69.2 | 61.5 |
| ViT + GR-KAN | 72.7 | 66.5 |
| SwinT + MLP | 77.1 | 70.5 |
| SwinT + GR-KAN | 79.2 | 71.3 |
| SwinT + RC + MLP | 80.4 | 74.7 |
| SwinT + RC + GR-KAN | 80.8 | 76.0 |
These results indicate that substituting MLPs with GR-KANs delivers consistent improvements on both 2D and 3D segmentation, with gains of +2–5% in Dice coefficient—a direct demonstration of the expressivity gained by rational function bases (Sapkota et al., 6 Nov 2025).
6. Expressivity, Data Efficiency, and Practical Implications
The use of rational activation units, especially Padé-type, affords adaptable nonlinearities: these can stretch, saturate, or invert per channel, facilitating the modeling of complex geometries and sharp transitions with reduced parameter budgets. Theoretical guarantees for the density of rational functions in , alongside their compactness for challenging function classes, underwrite the efficiency and generalizability seen in empirical evaluation.
Group sharing (GR-KAN) architecture balances parameter efficiency with channel-wise adaptivity, allowing high modeling capacity even from small training datasets. rKAN training stability, benefiting from the $1+|Q(x)|$ denominator, supports deeper stacking than previous spline-based implementations.
These advances collectively establish rKANs as a practical and powerful generalization module for dense prediction tasks, especially suited to regimes with limited labeled data or complex target structures, such as encountered in medical imaging segmentation (Aghaei, 2024, Sapkota et al., 6 Nov 2025).