Rational KANs (rKANs): Efficient Function Approximation

Updated 25 February 2026

Rational KANs (rKANs) are deep learning architectures that replace spline bases with rational function units to achieve efficient, high-precision function approximation.
The framework leverages Safe Padé units, using Padé approximants and rational Jacobi polynomials, to enhance expressivity and ensure numerical stability during training.
Grouped Rational KANs (GR-KANs) reduce parameter overhead by sharing rational bases across channel groups, yielding robust performance in tasks such as medical image segmentation.

Rational Kolmogorov-Arnold Networks (rKANs) are a class of deep learning architectures that generalize the Kolmogorov-Arnold representation using rational function basis units, such as Padé approximants and rational Jacobi polynomials, to enhance expressive power and computational efficiency. Originating as an evolution of classical Kolmogorov-Arnold networks (KANs), rKANs address the implementation complexity and parameter inefficiency of earlier spline-based KANs, offering robust alternatives for high-precision function approximation and data-efficient representation learning in both classical and modern neural architectures (Aghaei, 2024, Sapkota et al., 6 Nov 2025).

1. The Kolmogorov-Arnold Framework and Rational Basis Functions

The Kolmogorov-Arnold superposition theorem states that any continuous multivariate function $f(x_1,\dots,x_d)$ on a compact domain can, in principle, be decomposed into sums of univariate "edge" functions and an outer univariate function. KANs implement this concept by learning banks of univariate basis functions $\varphi(x)$ along each network channel.

To replace the B-spline bases historically used in KANs (which require a high number of parameters and knots, making implementation and training complex), rKANs introduce rational function bases. The principal formulation is the "Safe Padé Unit", defined by:

$\varphi(x) = w \cdot F(x) = w \cdot \frac{P(x)}{1 + |Q(x)|}$

where:

$P(x) = a_0 + a_1 x + \cdots + a_m x^m$ (degree- $m$ polynomial)
$Q(x) = b_1 x + \cdots + b_n x^n$ (degree- $n$ polynomial, no constant term)
$w$ is a learnable scalar uniquely per input–output channel edge

Empirical evidence suggests $m=3$ , $n=4$ provide favorable expressivity-to-stability tradeoffs. The denominator $1+|Q(x)|$ ensures numerical stability by preventing poles and gradient explosions during training. Classical Padé approximation theory guarantees that rational functions are dense in $\mathcal{C}[a, b]$ and can succinctly represent functions with sharp transitions or near-singularities more efficiently than pure polynomials (Aghaei, 2024, Sapkota et al., 6 Nov 2025).

2. Group Rational KANs (GR-KANs) and Parameter Efficiency

Vanilla spline-based KANs require a distinct basis $\varphi_{ij}(\cdot)$ for each input-to-output channel mapping, implicating $d_\mathrm{in}\times d_\mathrm{out}$ univariate basis functions and their respective knot parameters—a strategy that incurs significant parameter and memory overhead.

The grouped parameterization of rKANs, termed "Group Rational KANs" (GR-KANs), partitions input channels into $g$ groups (e.g., $g=8$ ). Within each group, all channels share a single rational base function $F_g(x)$ , defined by a common set of coefficients $\{a_0 \ldots a_m, b_1 \ldots b_n\}$ . Each input–output edge retains a learnable scalar $w_{ij}$ , ensuring per-edge adaptivity while vastly reducing the total number of parameters:

Unique rational coefficients: $g\cdot(m+1+n)$
Additional scalar weights: $d_\mathrm{in} \cdot d_\mathrm{out}$ (as in a linear layer)

This grouped strategy substantially compresses the parameter space compared to the $d_\mathrm{in}\cdot d_\mathrm{out}\cdot\#(\mathrm{spline~knots})$ scaling of vanilla KANs, yielding marked efficiency gains (Sapkota et al., 6 Nov 2025).

Model	FFN Type	Residual Conv?	GFLOPs	# Params
SwinUNETR	MLP	No	1.2500	6.302M
UKAST	GR-KAN	No	1.2467	6.302M + 608
SwinUNETR+RC	MLP	Yes	1.4419	7.183M
UKAST+RC	GR-KAN	Yes	1.4386	7.184M

This table demonstrates that substituting MLPs with GR-KANs yields a reduction in FLOPs of approximately 0.3–0.4%, with only several hundred additional parameters.

3. Network Integration and Architectural Usage

rKANs can function as core components in both classical and transformer-based neural architectures. In the UKAST architecture for medical image segmentation, GR-KANs are employed as drop-in replacements for the feedforward MLP+GELU blocks in the Swin Transformer stages. The architecture comprises:

An encoder of four Swin Transformer stages, each stage using residual convolutions followed by two GR-KAN-augmented feedforward blocks.
A CNN-style decoder with upsampling and skip connections for segmentation.

A single stage of the Swin+GR-KAN block operates as follows (suppressing stage superscripts):

$v_0 = RC(z_{in})$ (residual convolution)
$\hat{h}_1 = \text{W-MSA}(\text{LN}(v_0)) + v_0$ (window self-attention)
$z_1 = \text{GR-KAN}(\text{LN}(\hat{h}_1)) + \hat{h}_1$ (feedforward block 1)
$\hat{h}_2 = \text{SW-MSA}(\text{LN}(z_1)) + z_1$ (shifted-window attention)
$z_{out} = \text{GR-KAN}(\text{LN}(\hat{h}_2)) + \hat{h}_2$ (feedforward block 2)

Every GR-KAN layer maintains normalization and residual pathways, substituting only the nonlinearity and basis encoding mechanisms (Sapkota et al., 6 Nov 2025).

4. Training Protocols and Regularization

For rKAN-based models as in the UKAST system, the standard training regime includes:

Optimizer: AdamW with learning rate $2 \times 10^{-4}$ and weight decay $1\times 10^{-3}$
Training duration: 400 epochs with cosine annealing learning rate, batch size 24, single NVIDIA A10 GPU
Loss: Combined Dice and cross-entropy
Data augmentation: Random 320×320 crops, flips, $90^\circ$ rotations, Gaussian noise (masks matched to augmentations)
Inference with overlapping sliding windows (50% overlap)
Regularization: The intrinsic Safe Padé denominator ($1+|Q(x)|$) provides gradient stability, precluding the need for additional dropout or weight clipping beyond routine weight decay (Sapkota et al., 6 Nov 2025).

5. Empirical Evaluation and Ablation Analyses

Extensive benchmarking across diverse datasets substantiates the performance advantages of rKANs and GR-KANs, especially regarding expressivity and data efficiency. Selected quantitative results are outlined below:

Full-data Dice Score (%)

Dataset	U-Net	UNETR	SwinUNETR+RC	UKAST
Kvasir2D	82.3	63.7	81.9	81.7 (–0.2)
ISIC2D	79.3	74.7	78.9	79.9 (+1.0)
BCV3D	70.0	52.7	68.9	71.2 (+2.3)
MMWHS3D	73.4	70.3	80.4	80.8 (+0.4)

On scarce annotation regimes (10% of training data), UKAST shows enhanced robustness:

BCV3D@10%: SwinUNETR+RC 60.1 → UKAST 63.9 (+3.8)
ISIC2D@10%: SwinUNETR+RC 69.8 → UKAST 71.4 (+1.6)

Ablation Across Architectures (Average Dice %):

Configuration	2D	3D
ViT + MLP	69.2	61.5
ViT + GR-KAN	72.7	66.5
SwinT + MLP	77.1	70.5
SwinT + GR-KAN	79.2	71.3
SwinT + RC + MLP	80.4	74.7
SwinT + RC + GR-KAN	80.8	76.0

These results indicate that substituting MLPs with GR-KANs delivers consistent improvements on both 2D and 3D segmentation, with gains of +2–5% in Dice coefficient—a direct demonstration of the expressivity gained by rational function bases (Sapkota et al., 6 Nov 2025).

6. Expressivity, Data Efficiency, and Practical Implications

The use of rational activation units, especially Padé-type, affords adaptable nonlinearities: these can stretch, saturate, or invert per channel, facilitating the modeling of complex geometries and sharp transitions with reduced parameter budgets. Theoretical guarantees for the density of rational functions in $\mathcal{C}[a, b]$ , alongside their compactness for challenging function classes, underwrite the efficiency and generalizability seen in empirical evaluation.

Group sharing (GR-KAN) architecture balances parameter efficiency with channel-wise adaptivity, allowing high modeling capacity even from small training datasets. rKAN training stability, benefiting from the $1+|Q(x)|$ denominator, supports deeper stacking than previous spline-based implementations.

These advances collectively establish rKANs as a practical and powerful generalization module for dense prediction tasks, especially suited to regimes with limited labeled data or complex target structures, such as encountered in medical imaging segmentation (Aghaei, 2024, Sapkota et al., 6 Nov 2025).

Markdown Report Issue Upgrade to Chat

References (2)

rKAN: Rational Kolmogorov-Arnold Networks (2024)

When Swin Transformer Meets KANs: An Improved Transformer Architecture for Medical Image Segmentation (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Rational KANs (rKANs).

Rational KANs (rKANs): Efficient Function Approximation

1. The Kolmogorov-Arnold Framework and Rational Basis Functions

2. Group Rational KANs (GR-KANs) and Parameter Efficiency

3. Network Integration and Architectural Usage

4. Training Protocols and Regularization

5. Empirical Evaluation and Ablation Analyses

6. Expressivity, Data Efficiency, and Practical Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Rational KANs (rKANs): Efficient Function Approximation

1. The Kolmogorov-Arnold Framework and Rational Basis Functions

2. Group Rational KANs (GR-KANs) and Parameter Efficiency

3. Network Integration and Architectural Usage

4. Training Protocols and Regularization

5. Empirical Evaluation and Ablation Analyses

6. Expressivity, Data Efficiency, and Practical Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research