Rich-U-Net: A medical image segmentation model for fusing spatial depth features and capturing minute structural details

Published 31 Mar 2026 in eess.IV | (2603.29404v1)

Abstract: Medical image segmentation is of great significance in analysis of illness. The use of deep neural networks in medical image segmentation can help doctors extract regions of interest from complex medical images, thereby improving diagnostic accuracy and enabling better assessment of the condition to formulate treatment plans. However, most current medical image segmentation methods underperform in accurately extracting spatial information from medical images and mining potential complex structures and variations. In this article, we introduce the Rich-U-Net model, which effectively integrates both spatial and depth features. This fusion enhances the model's capability to detect fine structures and intricate details within complex medical images. Our multi-level and multi-dimensional feature fusion and optimization strategies enable our model to achieve fine structure localization and accurate segmentation results in medical image segmentation. Experiments on the ISIC2018, BUSI, GLAS, and CVC datasets show that Rich-U-Net surpasses other state-of-the-art models in Dice, IoU, and HD95 metrics.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper introduces Rich-U-Net, a novel model that fuses spatial context with LSTM-based depth modeling via Sparse K-Attention to capture minute anatomical details.
The paper employs a Multi-Scale Adaptive Gated Fusion module to combine multi-resolution features and improve boundary recovery in complex clinical images.
The paper demonstrates superior performance with higher Dice scores and lower HD95 on benchmarks, outperforming current transformer-based and multi-scale methods.

Rich-U-Net: A Model for Fusing Spatial Depth Features and Capturing Structural Detail in Medical Image Segmentation

Introduction

Rich-U-Net addresses limitations common to most U-Net variants in medical image segmentation, specifically the underutilization of spatial context and the insufficient modeling of minute and complex anatomical structures. The model synthesizes multiple advanced mechanisms, including Sparse K-Nearest Neighbor Attention (K-Attention), a novel Fusion-Layer that amalgamates LSTM-based depth modeling with spatial convolutions, and a Multi-Scale Adaptive Gated Fusion (MSAGF) module, to yield state-of-the-art results across several medical segmentation benchmarks. The architecture is explicitly optimized for complex clinical images where fine-grained boundary recovery and contextual understanding are critical.

Figure 1: Rich-U-Net architectural overview, integrating K-Attention, Fusion-Layer, and MSAGF modules within a U-shaped topology.

Model Architecture

Rich-U-Net preserves the encoder-bottleneck-decoder macro-structure typical of U-Net while embedding three major architectural innovations:

K-Attention Module: Introduced post-encoding, this mechanism sparsifies self-attention using a kNN-like top-k relationship mask, focusing computational and modeling capacity on the most relevant neighborhoods for each token. This facilitates robust local and global context modeling, particularly for boundaries and small structures with ambiguous contrast.
Fusion-Layer: Fuses temporal and spatial dependencies using LSTM to model sequence/contextual (depth or slice-level) information and depthwise separable convolutions for spatial detail extraction. A gating mechanism dynamically weighs these components, allowing selective information flow both temporally and spatially, which is crucial for medical images where context can be heterogeneous.
MSAGF Module: In the decoder, multi-scale features are aggregated through parallel global (channel) and spatial (pixel-wise) paths, both regulated by learned attention gates. This dual-path aggregation exploits low- and high-resolution cues for improved segmentation robustness, notably when lesion scale and appearance vary considerably.

These components are orchestrated with skip connections and residual operations to maximize feature reuse and maintain gradient flow during optimization.

Experimental Results

Experiments were conducted on ISIC2018 (skin lesions), BUSI (breast ultrasound), GLAS (colon histology), and CVC (colonoscopy polyp) benchmarks. Dice, IoU, and 95-th percentile Hausdorff Distance (HD95) were used for evaluation. Across all datasets, Rich-U-Net outperforms highly competitive baselines, including transformer-based architectures, lightweight variants, and recent multi-scale decoders.

On ISIC2018, Rich-U-Net achieves a Dice of 0.9116, surpassing UNeXt (0.9030), EMCAD (0.9096), and Trans-U-Net (0.8891).
On GLAS, the Dice is 0.9184, compared to UNeXt (0.8883) and EMCAD (0.9097).
Performance consistently translates to lower HD95, reflecting improvements in boundary accuracy.
Figure 2: Dice coefficient comparison across models on standardized benchmarks, highlighting Rich-U-Net’s lead.

Visualization of segmentation masks demonstrates effective delineation of minute structures, with clear superiority over contenders especially for regions of ambiguous intensity or small size.

Figure 3: Qualitative Dice performance examples on multiple datasets, showcasing segmentation accuracy at various scales.

Ablation Study

Ablation experiments disassemble the contributions of K-Attention, Fusion-Layer, and MSAGF:

Removing any single module results in significant performance degradation; for instance, omitting MSAGF drops Dice on ISIC2018 from 0.9116 to 0.9023.
The complete architecture, with all modules active, achieves the highest and most consistent results across domains.
The ablation boxplots indicate that the inclusion of interaction between all three modules yields the narrowest variance and best median/mean metrics.
Figure 4: Ablation boxplot illustrating the impact of each architectural contribution on segmentation performance.

Methodological Implications

K-Attention imposes structure on self-attention, reducing computational complexity from $O(N^2)$ to $O(kN)$ and enhancing feature selectivity.
Fusion-Layer bridges spatial and depth-wise (or inter-slice/temporal) dependencies; such integration can be readily extended to 3D volumetric data or multimodal temporal imaging.
MSAGF generalizes the notion of multi-scale fusion, making it agnostic to resolution and thus resilient to differences in imaging protocol, scale, or anatomical context.

These findings empirically support the necessity of contextually selective attention, scale-agnostic fusion, and explicit depth modeling in clinical segmentation. The modularity facilitates incorporation into diverse frameworks, including hybrid Transformer-CNN backbones.

Future Directions

Potential research directions include:

Fusing Rich-U-Net with unsupervised or semi-supervised objectives to leverage the wealth of unlabeled clinical imagery.
Extending MSAGF to cross-modal fusion (e.g., MRI and CT).
AutoML-based discovery of optimal scale fusion strategies for new imaging modalities.
Investigating interpretability, especially which module primarily contributes to error reduction in challenging scenarios (e.g., rare pathologies).

Conclusion

Rich-U-Net sets a new performance baseline for medical image segmentation by explicitly fusing spatial and depth features and emphasizing minute anatomical structure. Its performance superiority and robust design validate the importance of sparsity-aware attention, gated multi-scale fusion, and spatio-temporal context modeling in medical vision applications. The results motivate further exploration of context-dependent modeling and adaptive fusion strategies for broader biomedical imaging challenges.

Markdown Report Issue