U-Net Transformer: Self and Cross Attention for Medical Image Segmentation (2103.06104v2)

Published 10 Mar 2021 in eess.IV and cs.CV

Abstract: Medical image segmentation remains particularly challenging for complex and low-contrast anatomical structures. In this paper, we introduce the U-Transformer network, which combines a U-shaped architecture for image segmentation with self- and cross-attention from Transformers. U-Transformer overcomes the inability of U-Nets to model long-range contextual interactions and spatial dependencies, which are arguably crucial for accurate segmentation in challenging contexts. To this end, attention mechanisms are incorporated at two main levels: a self-attention module leverages global interactions between encoder features, while cross-attention in the skip connections allows a fine spatial recovery in the U-Net decoder by filtering out non-semantic features. Experiments on two abdominal CT-image datasets show the large performance gain brought out by U-Transformer compared to U-Net and local Attention U-Nets. We also highlight the importance of using both self- and cross-attention, and the nice interpretability features brought out by U-Transformer.

PDF Abstract

U-Net Transformer: Self and Cross Attention for Medical Image Segmentation

The paper "U-Net Transformer: Self and Cross Attention for Medical Image Segmentation" presents an advancement in the field of medical image segmentation by integrating the capabilities of transformers into the established U-Net architecture. Traditionally, U-Nets have stood as a benchmark due to their efficacy in biomedical image segmentation tasks. However, their limited ability to handle long-range dependencies and contextual information often poses a challenge, particularly in complex anatomical structures with low contrast.

The U-Transformer model extends the U-Net framework by incorporating both self-attention and cross-attention mechanisms derived from the transformer architecture. This integration addresses U-Net's inherent limitation of capturing only local interactions by allowing the model to leverage global contextual information, which is crucial for precise image segmentation.

The model introduces attention layers at two primary levels: a Multi-Head Self-Attention (MHSA) mechanism incorporated at the encoder's end and a Multi-Head Cross-Attention (MHCA) module applied across skip connections in the decoder. The self-attention mechanism effectively encodes the entire spatial context of the input image, allowing interactions between all feature elements. Meanwhile, cross-attention filters non-semantic features during the decoding process, enhancing the recovery of spatial resolution and semantic integrity.

Empirical evaluation of U-Transformer was conducted using two datasets— the TCIA pancreas dataset and an internal multi-organ dataset. Notably, the U-Transformer demonstrated significant improvements in segmentation accuracy over conventional U-Net and even the locally-augmented Attention U-Net, achieving dice similarity coefficients of 78.50% and 88.08% on the TCIA and multi-organ datasets, respectively. These results underscore the model's superior capabilities in challenging contexts, such as pancreas segmentation, which is hindered by its intricate shape and low contrast with neighboring structures.

Furthermore, the paper highlights the efficacy of combining both self-attention and cross-attention mechanisms. The individual performance gains from each module affirm their complementary nature and collective strength in improving segmentation outcomes. The utilization of positional encoding further enhances the model's contextual understanding by preserving the spatial hierarchy, crucial for anatomical consistency across datasets.

The implications of this research are substantial for medical imaging, where precise segmentation directly influences diagnostic and therapeutic decisions. By advancing the ability to model intricate contextual and spatial relationships, the U-Transformer can greatly aid radiologists and healthcare systems, ensuring more accurate assessments and informed interventions.

Future prospects of this work could explore the adaptation of U-Transformer to 3D medical imaging modalities such as MRI and ultrasound, potentially extending its applicability across various image-based diagnostic tasks. Additionally, exploring computational optimizations to manage the increased parameter count inherent in transformer-based models will be crucial for practical deployment in clinical settings.

This paper contributes meaningfully to the ongoing evolution of medical image segmentation technology by enhancing the capacity to represent global interactions, marking a step forward in bridging the gap between deep learning architectures and their application in healthcare.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Olivier Petit (1 paper)
Nicolas Thome (53 papers)
Clément Rambour (13 papers)
Luc Soler (5 papers)

Citations (206)

View on Semantic Scholar

U-Net Transformer: Self and Cross Attention for Medical Image Segmentation (2103.06104v2)

U-Net Transformer: Self and Cross Attention for Medical Image Segmentation

Related Papers