UKAST: KAN-Enhanced Swin Transformer

Updated 11 November 2025

UKAST is a novel neural architecture for medical image segmentation that unifies a Swin Transformer encoder with rational-function-based GR-KANs.
It employs a U-Net-style encoder-decoder design where GR-KAN blocks replace traditional MLPs to enhance long-range dependency modeling and data efficiency.
Empirical evaluations show UKAST achieves state-of-the-art Dice scores across multiple benchmarks with minimal computational overhead, especially in low-data scenarios.

UKAST (U-Net-KAN-Enhanced Swin Transformer) is a neural architecture for medical image segmentation that unifies a Swin Transformer encoder with rational-function-based Kolmogorov-Arnold Networks (KANs) in its feed-forward layers. The architecture integrates Group Rational KANs (GR-KANs) for expressive and data-efficient modeling, addressing the challenges of long-range dependency modeling, computational cost, and data efficiency in segmentation tasks with limited annotated data (Sapkota et al., 6 Nov 2025).

1. Architectural Overview

UKAST employs a U-Net-style encoder-decoder design tailored for dense segmentation prediction. The encoder consists of a four-stage Swin Transformer backbone incorporating shifted-windowed self-attention, residual convolutional projections (RC), and GR-KANs. The input $\mathcal{X}\in\mathbb{R}^{C\times H\times W}$ is partitioned into non-overlapping patches, which are embedded into tokens via a learned linear projection. Each encoder stage operates at progressively reduced spatial resolution and increased channel depth.

The encoder stage structure is as follows:

Residual convolution (RC) projection: $v^{(s)}_0 = \mathrm{RC}\bigl(z^{(s)}_{\text{in}}\bigr)$
Windowed multi-head self-attention (W-MSA) + residual: $\hat z^{(s)}_1 = \mathrm{W\!-\!MSA}\bigl(\mathrm{LN}(v^{(s)}_0)\bigr) + v^{(s)}_0$
First GR-KAN feed-forward + residual: $z^{(s)}_1 = \mathrm{GR\!-\!KAN}\bigl(\mathrm{LN}(\hat z^{(s)}_1)\bigr) + \hat z^{(s)}_1$
Shifted windowed MSA (SW-MSA) + residual: $\hat z^{(s)}_2 = \mathrm{SW\!-\!MSA}\bigl(\mathrm{LN}(z^{(s)}_1)\bigr) + z^{(s)}_1$
Second GR-KAN feed-forward + residual: $z^{(s)}_{\text{out}} = \mathrm{GR\!-\!KAN}\bigl(\mathrm{LN}(\hat z^{(s)}_2)\bigr) + \hat z^{(s)}_2$

Intermediate features $z^{(s)}_{\text{out}}$ are supplied via lateral skip connections to a symmetric CNN-based decoder. Each decoder stage performs deconvolution (upsampling), a Conv–BatchNorm–ReLU block, and concatenation with encoder-derived features. A final $1\times1$ convolution projects the output to segmentation logits.

2. Rational-Function KANs and GR-KAN Integration

KANs serve as the feed-forward component in UKAST, replacing standard MLP blocks. Unlike conventional fixed activations (e.g., ReLU, GELU), UKAST leverages rational base functions of the form

$\phi(x) = w F(x),\qquad F(x) = \frac{P(x)}{1 + |Q(x)|}$

where $P(x) = a_0 + a_1 x + \cdots + a_m x^m$ and $Q(x) = b_1 x + \cdots + b_n x^n$ are polynomials with empirically chosen degrees $m=3$ , $n=4$ . The denominator ensures numerical stability, termed the “Safe Padé Activation Unit”.

For computational tractability, the Group Rational KAN (GR-KAN) blocks partition input channels $d_{\text{in}}$ into $g$ groups ( $g=8$ in experiments), sharing rational polynomial parameters $\{a_i, b_j\}$ within each group while maintaining independent scalar weights $w$ per edge. Formally,

$\mathrm{GR\!-\!KAN}(\mathbf{x}) = W\left[F(\mathbf{x}_{(1)}) \oplus \cdots \oplus F(\mathbf{x}_{(g)})\right] + b,$

with $\oplus$ denoting channel-group concatenation. This structure reduces the number of unique polynomial parameters from $d_{\text{in}}\times d_{\text{out}}$ to $g$ , yielding lower FLOPs and memory cost.

3. Computational Characteristics

A quantitative assessment of UKAST’s efficiency relative to SwinUNETR (the immediate baseline) is given in the table below:

Model	FFN	RC?	GFLOPs	#Params
SwinUNETR	MLP	✗	1.2500	6.302 M
UKAST	GR-KAN	✗	1.2467	6.3028 M
SwinUNETR + RC	MLP	✔	1.4419	7.1835 M
UKAST + RC	GR-KAN	✔	1.4386	7.1841 M

Replacing MLP with GR-KAN in the feed-forward network reduces total GFLOPs by approximately 0.3–0.4% and increases the parameter count by only $\sim$ 600 parameters in the RC-augmented variants, resulting in $\sim$ 1.4386 GFLOPs and $7.18$M parameters for UKAST+RC.

4. Empirical Evaluation on Medical Segmentation Benchmarks

UKAST was evaluated on four benchmarks: Kvasir-SEG and ISIC-2017 (2D datasets), and BCV (CT) and MMWHS (MRI) for 3D tasks. Dice scores, reported for both fully supervised and limited-data regimes, are as follows:

100% Data (Dice Score)
- Kvasir: SwinUNETR+RC 81.9 vs UKAST 81.7 (–0.2)
- ISIC: 78.9 vs 79.9 (+1.0)
- BCV: 68.9 vs 71.2 (+2.3)
- MMWHS: 80.4 vs 80.8 (+0.4)
Limited Data Regimes (Dice Gain, UKAST above SwinUNETR+RC)
- ISIC (10%, 25%, 50%, 100%): +1.6 / +0.3 / +1.2 / +1.0
- BCV: +3.8 / +4.9 / +4.1 / +2.3

Performance in low-data scenarios—especially on 3D volumes—demonstrates that KAN-enhanced Transformers deliver significant data-efficiency improvements over MLP-based counterparts.

5. Comparative Analysis with Other Architectures

The following summarizes key results against contemporary CNN and Transformer baselines (parameters and Dice on [Kvasir, ISIC, BCV, MMWHS]):

Model	Params	Kvasir	ISIC	BCV	MMWHS
U-Net	2.6 M	71.8	77.3	59.3	71.9
UNETR	8.3 M	63.7	74.7	52.7	70.3
SwinUNETR+RC	7.2 M	81.9	78.9	68.9	80.4
UKAST (Ours)	7.2 M	81.7	79.9	71.2	80.8

UKAST achieves parity or state-of-the-art performance for all listed tasks, with the notable advantage of improved accuracy in data-scarce regimes and comparable computational load.

6. Implementation Details

UKAST is implemented using PyTorch and the MONAI imaging toolkit. Training uses AdamW (learning rate $2\times10^{-4}$ , weight decay $1\times10^{-3}$ ) with cosine annealing over 400 epochs and batch size 24. Data augmentations include random $320\times320$ crops, horizontal/vertical flips, 90° rotations, and Gaussian noise, applied equally to input images and masks. Testing employs overlapping patch-based sliding window inference (50% overlap).

The encoder pseudocode per stage $s$ is:

for stage s in [1..4]:
    input = previous_output or patch_embeddings
    v0 = ResidualConv(input)
    a1 = W_MSA(LN(v0)) + v0
    b1 = GR_KAN(LN(a1)) + a1
    a2 = SW_MSA(LN(b1)) + b1
    z_out = GR_KAN(LN(a2)) + a2
    store z_out for skip connection
    downsample z_out for next stage

The decoder mirrors this structure, upsampling features and fusing via concatenation and Conv–BN–ReLU sequence before a final

1\times1

convolution.

7. Significance and Outlook

UKAST establishes that integrating rational-function KANs as the feed-forward mechanism in hierarchical Swin Transformer encoders yields a model that is not only competitive with existing vision Transformers but also robust to scarce data scenarios, especially for 3D segmentation. The approach incurs negligible additional computational overhead, challenging the notion that expressivity increases must trade off against efficiency. This suggests broader applicability of KAN-augmented attention architectures for other data-efficient vision problems in the biomedical domain and beyond. Future research may further optimize group sizes within GR-KANs or investigate other rational-function parameterizations for even greater flexibility and compactness.

PDF Markdown Chat (Pro)

References (1)

When Swin Transformer Meets KANs: An Improved Transformer Architecture for Medical Image Segmentation (2025)

Follow Topic

Get notified by email when new papers are published related to UKAST.