Mixed Transformer U-Net For Medical Image Segmentation (2111.04734v2)

Published 8 Nov 2021 in eess.IV, cs.AI, and cs.CV

Abstract: Though U-Net has achieved tremendous success in medical image segmentation tasks, it lacks the ability to explicitly model long-range dependencies. Therefore, Vision Transformers have emerged as alternative segmentation structures recently, for their innate ability of capturing long-range correlations through Self-Attention (SA). However, Transformers usually rely on large-scale pre-training and have high computational complexity. Furthermore, SA can only model self-affinities within a single sample, ignoring the potential correlations of the overall dataset. To address these problems, we propose a novel Transformer module named Mixed Transformer Module (MTM) for simultaneous inter- and intra- affinities learning. MTM first calculates self-affinities efficiently through our well-designed Local-Global Gaussian-Weighted Self-Attention (LGG-SA). Then, it mines inter-connections between data samples through External Attention (EA). By using MTM, we construct a U-shaped model named Mixed Transformer U-Net (MT-UNet) for accurate medical image segmentation. We test our method on two different public datasets, and the experimental results show that the proposed method achieves better performance over other state-of-the-art methods. The code is available at: https://github.com/Dootmaan/MT-UNet.

Authors (7)

Hongyi Wang (62 papers)
Shiao Xie (4 papers)
Lanfen Lin (36 papers)
Yutaro Iwamoto (12 papers)
Xian-Hua Han (6 papers)
Yen-Wei Chen (36 papers)
Ruofeng Tong (25 papers)

Citations (203)

View on Semantic Scholar

Summary

Evaluation of the Mixed Transformer U-Net for Medical Image Segmentation

The paper presents an innovative approach to medical image segmentation by introducing the Mixed Transformer U-Net (MT-UNet), which integrates a newly proposed module called the Mixed Transformer Module (MTM). This architecture aims to leverage the strength of U-Net and Transformers to address the challenges associated with modeling long-range dependencies in image processing tasks.

Key Components and Methodological Innovations

The research outlines several key innovations:

Mixed Transformer Module (MTM): The MTM is central to this new architecture, facilitating simultaneous learning of inter- and intra-sample affinities. The module effectively marries Local-Global Gaussian-Weighted Self-Attention (LGG-SA) with External Attention (EA) to optimize the perception of both localized and global image contexts.
Local-Global Gaussian-Weighted Self-Attention (LGG-SA): This mechanism extends traditional self-attention by enhancing the weight of proximal regions around each query. It applies self-attention to both fine-grained local context and coarse-grained global context, resulting in improved computational efficiency and performance for medical image processing.
Convolution Stem Integration: Unlike standard Transformers which lack inherent structural priors, MT-UNet introduces Convolution Stem in shallow layers, improving feature extraction and providing useful priors for segmentation tasks, especially when dealing with small datasets typical in medical imaging.

Performance Evaluation

The research evaluates the model's performance across two datasets: the Synapse multi-organ segmentation dataset and the ACDC cardiac MRI dataset. The results demonstrate superior performance in terms of Dice Similarity Coefficient (DSC) and 95% Hausdorff Distance (HD95) when compared to other state-of-the-art methods.

On the Synapse Dataset, MT-UNet achieved a DSC of 78.59% and HD95 of 26.59 mm, outperforming other baseline architectures including TransUNet and CNN variants.
With the ACDC Dataset, a DSC of 90.43% was reported, indicating notable improvements in precision for medical segmentation tasks.

Technical Implications and Future Directions

The deployment of MT-UNet introduces an effective balance between computational overhead and model performance by selectively employing Transformers in deeper network layers while retaining convolutional operations in the initial layers. This approach holds practical significance as it improves segmentation performance without necessitating extensive pre-training or large-scale data, overcoming limitations prevalent in traditional Transformer models used in medical imaging.

Future directions facilitated by this research potentially include extending the model's applicability to a broader spectrum of segmentation challenges, possibly integrating unsupervised pre-training to mitigate the expensive costs related to labeled medical datasets. Additionally, the adaptability of the MT-UNet structure to varying datasets, especially those outside the domain of medical imaging, could be explored further.

Thus, the MT-UNet represents a critical progression in leveraging cutting-edge Transformer mechanisms for enhancing the precision and efficiency of medical image segmentation, paving the way for its application in real-world medical interventions.

PDF Markdown