MD-RWKV-UNet: Efficient Medical Segmentation

Updated 26 May 2026

The paper introduces MD-RWKV-UNet, a U-Net style architecture that integrates RWKV attention for efficient, linear-cost global context aggregation.
It features innovative modules like IR-RWKV blocks, dynamic scale-aware encoding, and pure RWKV variants to model long-range dependencies across multi-organ segmentation tasks.
Experimental results show state-of-the-art performance with reduced parameters and FLOPs, outperforming traditional CNNs and transformer-based methods on diverse benchmarks.

MD-RWKV-UNet refers to a family of U-Net–style medical image segmentation architectures that integrate Receptance Weighted Key Value (RWKV) attention into the encoder-decoder framework, leveraging the unique properties of RWKV for linear-cost global context aggregation. Several variants of MD-RWKV-UNet have appeared since 2025, with each introducing architectural advances targeting long-range dependency modeling, scale awareness, dynamic adaptation, and computational efficiency. These models consistently achieve or surpass state-of-the-art performance on benchmarks across multi-organ and binary segmentation tasks in diverse imaging modalities.

1. Motivation and RWKV Foundations

Conventional U-Net architectures, while effective for local detail extraction, struggle to model long-range pixel dependencies due to the inherent locality of convolutional kernels. Transformer-based networks offer global receptive fields but are hindered by the quadratic complexity $O(N^2)$ in spatial token sequence length $N=H \cdot W$ , which is particularly problematic for high-resolution medical image segmentation. RWKV attention mechanisms combine a per-token transform (as in Transformers) with an RNN-style, linearly scan-able recurrence, yielding a global receptive field at $O(ND)$ cost without sacrificing sequence length or feature dimension $D$ (Jiang et al., 14 Jan 2025, Fang, 28 Mar 2026, Ye et al., 15 Jul 2025, Zhou, 12 Jun 2025).

The RWKV recurrence, at each spatial token $t$ , computes: $k_t = W_k x_t + b_k, \quad v_t = W_v x_t + b_v, \quad r_t = \sigma(W_r x_t + b_r)$ maintaining scan-state accumulators: $S_t^{\exp} = \lambda S_{t-1}^{\exp} + e^{k_t}, \quad S_t^{\mathrm{kv}} = \lambda S_{t-1}^{\mathrm{kv}} + e^{k_t} v_t$ so that

$y_t = r_t \odot \frac{S_t^{\mathrm{kv}}}{S_t^{\exp}}$

where $\lambda$ is a decay hyperparameter, and $r_t$ acts as a per-token gating (Jiang et al., 14 Jan 2025, Zhou, 12 Jun 2025). This mechanism is efficiently parallelizable and grants global information access at every scale.

2. Architectural Variants and Core Modules

2.1. Early Hybrid Approaches

The initial "RWKV-UNet" implemented IR-RWKV blocks, inserting spatial RWKV attention between layers of inverted residual convolutions in the encoder. This hybridization allowed early encoder stages to specialize in local detail, with later stages capturing long-range dependencies. IR-RWKV blocks expand channels, apply normalization, unfold to tokens, and perform SpatialMix followed by local depthwise convolutions, with Cross-Channel Mix (CCM) enhancing multi-level skip connections (Jiang et al., 14 Jan 2025).

2.2. Dynamic and Scale-Aware Extensions

Focusing on adaptive modeling of anatomical variability, "MD-RWKV-UNet: Scale-Aware Anatomical Encoding with Cross-Stage Fusion" introduced:

MD-RWKV blocks: Dual-path modules with (i) standard depthwise separable convolutions and (ii) dynamic branches employing deformable spatial shift operations with learned, content-dependent offsets, followed by RWKV recurrence. This enables the network’s receptive field to adapt spatially to organ morphology (Fang, 28 Mar 2026).
Selective Kernel Attention (SKA): Adaptive selection among convolutional kernels of different sizes based on learned attention, enhancing robustness across scales (Fang, 28 Mar 2026).
Cross-Stage Dual-Attention Fusion: Aggregates features from multiple encoder stages, fusing spatial and semantic information using both channel-wise and spatial attention mechanisms (Fang, 28 Mar 2026).

2.3. Pure RWKV Architectures

"Med-URWKV" advanced a pure RWKV U-Net that replaces all convolutions and transformers with RWKV attention. The encoder leverages a large-scale, ImageNet-pretrained Vision-RWKV, ensuring full global context at every scale and dramatic parameter reductions (e.g., 14.3M vs. 24.9M for UNet) (Zhou, 12 Jun 2025). Channel-mix (FFN-like) sub-blocks complement the spatial-mix, all retained within the RWKV paradigm.

2.4. Multi-Dimensional and Directional Extensions

The U-RWKV framework emphasizes direction-adaptive and multi-dimensional RWKV:

QuadScan and Dual-RWKV: DARM (Direction-Adaptive RWKV Module) scans feature maps along multiple axes (horizontal, vertical, and—conceptually for MD-RWKV, volumetric axes), fusing independent forward and reverse passes to invalidate scan-order bias (Ye et al., 15 Jul 2025).
Stage-Adaptive Squeeze-and-Excitation (SASE): Channel re-weighting is dynamically adapted based on feature map depth/stage, optimizing channel capacity across varying field densities (Ye et al., 15 Jul 2025). Extension to 3D MD-RWKV-UNet involves per-volume multi-directional scans, channel squeeze along $N=H \cdot W$ 0, and memory management strategies.

3. Training Protocols and Implementation Details

MD-RWKV-UNet studies utilize standard segmentation training pipelines:

Losses: Weighted sum of pixel-wise cross-entropy and Dice loss functions. Settings: $N=H \cdot W$ 1 or using a higher Dice weight for binary tasks (Jiang et al., 14 Jan 2025, Fang, 28 Mar 2026, Zhou, 12 Jun 2025).
Optimizer: AdamW, often with cosine annealing schedules and image augmentations (flips, rotations, intensity scaling, additive noise) (Jiang et al., 14 Jan 2025, Fang, 28 Mar 2026, Zhou, 12 Jun 2025).
Pretraining: Pure RWKV variants benefit from ImageNet classification pretraining, providing substantial boosts (e.g., encoder pretraining increases Synapse DSC from ~78.7% to 84.0%) (Jiang et al., 14 Jan 2025, Zhou, 12 Jun 2025).
Data: Benchmarks include Synapse, ACDC, BUSI, CVC-ClinicDB, CVC-ColonDB, Kvasir-SEG, ISIC, GLAS, TDD, and NKUT datasets with variable train/val/test splits (Fang, 28 Mar 2026, Zhou, 12 Jun 2025).

4. Quantitative Results and Ablation Analyses

Performance across MD-RWKV-UNet variants consistently exceeds U-Net, nn-U-Net, TransUNet, and hybrid transformer baselines in both average Dice coefficient and boundary-based metrics such as HD95.

Example comparative results:

Model	Synapse DSC ↑	Synapse HD95 ↓	ACDC DSC ↑	ACDC HD95 ↓	Params	FLOPs
U-Net	76.85	39.70	85.98	19.31	24.9M	—
nnU-Net	82.35	18.76	—	—	—	—
RWKV-UNet	84.02	16.09	92.17	2.75	16.7M	9.0G
MD-RWKV-UNet (full)	85.07	14.67	93.15	1.77	—	—
Med-URWKV (pure RWKV)	90.93 (K-SEG)	—	91.02 (GL)	—	14.3M	—

Performance gain on difficult categories is notable; e.g., pancreas segmentation DSC = 69.38% (best reported at publication) on Synapse (Jiang et al., 14 Jan 2025), and low HD95 on small or irregular organs.

Ablation studies show that:

Adding SKA, DeformableShift, and CrossStageFusion incrementally improves DSC and HD95.
Pretraining consistently increases performance.
ChannelMix in IR-RWKV can be omitted with minimal loss, significantly reducing FLOPs (Jiang et al., 14 Jan 2025).
Adaptive skip/fusion modules (CCM, CrossStageFusion) provide measurable boosts over naïve concatenation.

5. Computational and Practical Considerations

The RWKV mechanism enables global modeling at $N=H \cdot W$ 2 cost per block—dramatically reducing memory and runtime compared to $N=H \cdot W$ 3 self-attention (Jiang et al., 14 Jan 2025, Ye et al., 15 Jul 2025, Zhou, 12 Jun 2025). Empirically, models such as U-RWKV achieve 2.97M parameters and 7.28 GFLOPs with average Dice of 82.27 across five datasets, and even pure RWKV U-Nets (Med-URWKV) with 14.3M parameters outperform comparably sized transformer and CNN baselines (Zhou, 12 Jun 2025). Volumetric MD-RWKV-UNet is proposed as feasible for large 3D inputs by performing multi-axis scans in linear time (Ye et al., 15 Jul 2025). Lightweight variants (T/S/Base) are documented as suitable for resource-limited deployments and edge inference (Jiang et al., 14 Jan 2025).

6. Limitations, Failure Cases, and Future Directions

Residual limitations include reduced accuracy for extremely small lesion structures (under 50 pixels), and over-smooth boundaries where the learnable decay $N=H \cdot W$ 4 for the global scan is too low, leading to excessive spread of attention (Zhou, 12 Jun 2025). Highly irregular shapes (e.g., in GLAS) can elicit "soft" segmentations. Future proposals include extending MD-RWKV-UNet to full 3D segmentation, refining dynamic attention mechanisms, and integrating more robust loss weighting for small targets. The use of large-scale pretraining and low-cost, directionally comprehensive RWKV scanning is likely to further advance segmentation on heterogeneous, high-resolution, and multi-modality medical images.

7. Significance and Impact

MD-RWKV-UNet bridges the gap between locality-focused CNNs and costly, global-context-transformers, offering a scalable, efficient solution for both 2D and 3D clinical segmentation applications. It provides rigorous mechanisms for multi-scale information fusion, adaptive receptive field adjustment, and directional context propagation—all experimentally validated to deliver state-of-the-art performance. The architectures’ design rationale, mathematical base, and empirical benchmarks are documented across multiple research groups, making MD-RWKV-UNet a widely adopted reference point in contemporary medical image analysis (Jiang et al., 14 Jan 2025, Fang, 28 Mar 2026, Ye et al., 15 Jul 2025, Zhou, 12 Jun 2025).