Papers
Topics
Authors
Recent
Search
2000 character limit reached

Rational KANs (rKANs): Efficient Function Approximation

Updated 25 February 2026
  • Rational KANs (rKANs) are deep learning architectures that replace spline bases with rational function units to achieve efficient, high-precision function approximation.
  • The framework leverages Safe Padé units, using Padé approximants and rational Jacobi polynomials, to enhance expressivity and ensure numerical stability during training.
  • Grouped Rational KANs (GR-KANs) reduce parameter overhead by sharing rational bases across channel groups, yielding robust performance in tasks such as medical image segmentation.

Rational Kolmogorov-Arnold Networks (rKANs) are a class of deep learning architectures that generalize the Kolmogorov-Arnold representation using rational function basis units, such as Padé approximants and rational Jacobi polynomials, to enhance expressive power and computational efficiency. Originating as an evolution of classical Kolmogorov-Arnold networks (KANs), rKANs address the implementation complexity and parameter inefficiency of earlier spline-based KANs, offering robust alternatives for high-precision function approximation and data-efficient representation learning in both classical and modern neural architectures (Aghaei, 2024, Sapkota et al., 6 Nov 2025).

1. The Kolmogorov-Arnold Framework and Rational Basis Functions

The Kolmogorov-Arnold superposition theorem states that any continuous multivariate function f(x1,,xd)f(x_1,\dots,x_d) on a compact domain can, in principle, be decomposed into sums of univariate "edge" functions and an outer univariate function. KANs implement this concept by learning banks of univariate basis functions φ(x)\varphi(x) along each network channel.

To replace the B-spline bases historically used in KANs (which require a high number of parameters and knots, making implementation and training complex), rKANs introduce rational function bases. The principal formulation is the "Safe Padé Unit", defined by:

φ(x)=wF(x)=wP(x)1+Q(x)\varphi(x) = w \cdot F(x) = w \cdot \frac{P(x)}{1 + |Q(x)|}

where:

  • P(x)=a0+a1x++amxmP(x) = a_0 + a_1 x + \cdots + a_m x^m (degree-mm polynomial)
  • Q(x)=b1x++bnxnQ(x) = b_1 x + \cdots + b_n x^n (degree-nn polynomial, no constant term)
  • ww is a learnable scalar uniquely per input–output channel edge

Empirical evidence suggests m=3m=3, n=4n=4 provide favorable expressivity-to-stability tradeoffs. The denominator $1+|Q(x)|$ ensures numerical stability by preventing poles and gradient explosions during training. Classical Padé approximation theory guarantees that rational functions are dense in C[a,b]\mathcal{C}[a, b] and can succinctly represent functions with sharp transitions or near-singularities more efficiently than pure polynomials (Aghaei, 2024, Sapkota et al., 6 Nov 2025).

2. Group Rational KANs (GR-KANs) and Parameter Efficiency

Vanilla spline-based KANs require a distinct basis φij()\varphi_{ij}(\cdot) for each input-to-output channel mapping, implicating din×doutd_\mathrm{in}\times d_\mathrm{out} univariate basis functions and their respective knot parameters—a strategy that incurs significant parameter and memory overhead.

The grouped parameterization of rKANs, termed "Group Rational KANs" (GR-KANs), partitions input channels into gg groups (e.g., g=8g=8). Within each group, all channels share a single rational base function Fg(x)F_g(x), defined by a common set of coefficients {a0am,b1bn}\{a_0 \ldots a_m, b_1 \ldots b_n\}. Each input–output edge retains a learnable scalar wijw_{ij}, ensuring per-edge adaptivity while vastly reducing the total number of parameters:

  • Unique rational coefficients: g(m+1+n)g\cdot(m+1+n)
  • Additional scalar weights: dindoutd_\mathrm{in} \cdot d_\mathrm{out} (as in a linear layer)

This grouped strategy substantially compresses the parameter space compared to the dindout#(spline knots)d_\mathrm{in}\cdot d_\mathrm{out}\cdot\#(\mathrm{spline~knots}) scaling of vanilla KANs, yielding marked efficiency gains (Sapkota et al., 6 Nov 2025).

Model FFN Type Residual Conv? GFLOPs # Params
SwinUNETR MLP No 1.2500 6.302M
UKAST GR-KAN No 1.2467 6.302M + 608
SwinUNETR+RC MLP Yes 1.4419 7.183M
UKAST+RC GR-KAN Yes 1.4386 7.184M

This table demonstrates that substituting MLPs with GR-KANs yields a reduction in FLOPs of approximately 0.3–0.4%, with only several hundred additional parameters.

3. Network Integration and Architectural Usage

rKANs can function as core components in both classical and transformer-based neural architectures. In the UKAST architecture for medical image segmentation, GR-KANs are employed as drop-in replacements for the feedforward MLP+GELU blocks in the Swin Transformer stages. The architecture comprises:

  • An encoder of four Swin Transformer stages, each stage using residual convolutions followed by two GR-KAN-augmented feedforward blocks.
  • A CNN-style decoder with upsampling and skip connections for segmentation.

A single stage of the Swin+GR-KAN block operates as follows (suppressing stage superscripts):

  1. v0=RC(zin)v_0 = RC(z_{in}) (residual convolution)
  2. h^1=W-MSA(LN(v0))+v0\hat{h}_1 = \text{W-MSA}(\text{LN}(v_0)) + v_0 (window self-attention)
  3. z1=GR-KAN(LN(h^1))+h^1z_1 = \text{GR-KAN}(\text{LN}(\hat{h}_1)) + \hat{h}_1 (feedforward block 1)
  4. h^2=SW-MSA(LN(z1))+z1\hat{h}_2 = \text{SW-MSA}(\text{LN}(z_1)) + z_1 (shifted-window attention)
  5. zout=GR-KAN(LN(h^2))+h^2z_{out} = \text{GR-KAN}(\text{LN}(\hat{h}_2)) + \hat{h}_2 (feedforward block 2)

Every GR-KAN layer maintains normalization and residual pathways, substituting only the nonlinearity and basis encoding mechanisms (Sapkota et al., 6 Nov 2025).

4. Training Protocols and Regularization

For rKAN-based models as in the UKAST system, the standard training regime includes:

  • Optimizer: AdamW with learning rate 2×1042 \times 10^{-4} and weight decay 1×1031\times 10^{-3}
  • Training duration: 400 epochs with cosine annealing learning rate, batch size 24, single NVIDIA A10 GPU
  • Loss: Combined Dice and cross-entropy
  • Data augmentation: Random 320×320 crops, flips, 9090^\circ rotations, Gaussian noise (masks matched to augmentations)
  • Inference with overlapping sliding windows (50% overlap)
  • Regularization: The intrinsic Safe Padé denominator ($1+|Q(x)|$) provides gradient stability, precluding the need for additional dropout or weight clipping beyond routine weight decay (Sapkota et al., 6 Nov 2025).

5. Empirical Evaluation and Ablation Analyses

Extensive benchmarking across diverse datasets substantiates the performance advantages of rKANs and GR-KANs, especially regarding expressivity and data efficiency. Selected quantitative results are outlined below:

Full-data Dice Score (%)

Dataset U-Net UNETR SwinUNETR+RC UKAST
Kvasir2D 82.3 63.7 81.9 81.7 (–0.2)
ISIC2D 79.3 74.7 78.9 79.9 (+1.0)
BCV3D 70.0 52.7 68.9 71.2 (+2.3)
MMWHS3D 73.4 70.3 80.4 80.8 (+0.4)

On scarce annotation regimes (10% of training data), UKAST shows enhanced robustness:

  • BCV3D@10%: SwinUNETR+RC 60.1 → UKAST 63.9 (+3.8)
  • ISIC2D@10%: SwinUNETR+RC 69.8 → UKAST 71.4 (+1.6)

Ablation Across Architectures (Average Dice %):

Configuration 2D 3D
ViT + MLP 69.2 61.5
ViT + GR-KAN 72.7 66.5
SwinT + MLP 77.1 70.5
SwinT + GR-KAN 79.2 71.3
SwinT + RC + MLP 80.4 74.7
SwinT + RC + GR-KAN 80.8 76.0

These results indicate that substituting MLPs with GR-KANs delivers consistent improvements on both 2D and 3D segmentation, with gains of +2–5% in Dice coefficient—a direct demonstration of the expressivity gained by rational function bases (Sapkota et al., 6 Nov 2025).

6. Expressivity, Data Efficiency, and Practical Implications

The use of rational activation units, especially Padé-type, affords adaptable nonlinearities: these can stretch, saturate, or invert per channel, facilitating the modeling of complex geometries and sharp transitions with reduced parameter budgets. Theoretical guarantees for the density of rational functions in C[a,b]\mathcal{C}[a, b], alongside their compactness for challenging function classes, underwrite the efficiency and generalizability seen in empirical evaluation.

Group sharing (GR-KAN) architecture balances parameter efficiency with channel-wise adaptivity, allowing high modeling capacity even from small training datasets. rKAN training stability, benefiting from the $1+|Q(x)|$ denominator, supports deeper stacking than previous spline-based implementations.

These advances collectively establish rKANs as a practical and powerful generalization module for dense prediction tasks, especially suited to regimes with limited labeled data or complex target structures, such as encountered in medical imaging segmentation (Aghaei, 2024, Sapkota et al., 6 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Rational KANs (rKANs).