UTNet: A Hybrid Transformer Architecture for Medical Image Segmentation (2107.00781v2)

Published 2 Jul 2021 in cs.CV

Abstract: Transformer architecture has emerged to be successful in a number of natural language processing tasks. However, its applications to medical vision remain largely unexplored. In this study, we present UTNet, a simple yet powerful hybrid Transformer architecture that integrates self-attention into a convolutional neural network for enhancing medical image segmentation. UTNet applies self-attention modules in both encoder and decoder for capturing long-range dependency at different scales with minimal overhead. To this end, we propose an efficient self-attention mechanism along with relative position encoding that reduces the complexity of self-attention operation significantly from $O(n^2)$ to approximate $O(n)$. A new self-attention decoder is also proposed to recover fine-grained details from the skipped connections in the encoder. Our approach addresses the dilemma that Transformer requires huge amounts of data to learn vision inductive bias. Our hybrid layer design allows the initialization of Transformer into convolutional networks without a need of pre-training. We have evaluated UTNet on the multi-label, multi-vendor cardiac magnetic resonance imaging cohort. UTNet demonstrates superior segmentation performance and robustness against the state-of-the-art approaches, holding the promise to generalize well on other medical image segmentations.

Authors (3)

Yunhe Gao (19 papers)
Mu Zhou (25 papers)
Dimitris Metaxas (85 papers)

Citations (370)

View on Semantic Scholar

Summary

UTNet: A Hybrid Transformer Architecture for Medical Image Segmentation

The paper "UTNet: A Hybrid Transformer Architecture for Medical Image Segmentation" introduces a novel architecture that integrates Transformer self-attention mechanisms within a convolutional neural network (CNN) framework to address medical image segmentation tasks. This hybrid model, termed UTNet, aims to harness the extensive sequence modeling capabilities of Transformers together with the locality-specific advantages of CNNs.

Summary of Contributions

UTNet provides an efficient architecture for medical imaging applications by embedding Transformer self-attention modules in both the encoder and decoder segments of a U-shaped network. The paper tackles several challenges inherent in applying conventional Transformers directly to image data—primarily their extensive computational overhead and the requirement for vast data to learn inductive biases.

Key contributions of the UTNet architecture are:

Efficient Self-Attention with Linear Complexity: The authors propose a self-attention mechanism that approximates a complexity reduction from $O(n^2)$ to $O(n)$ , thereby making the model feasible for high-resolution medical images.
Hybrid Architecture Integration: Through the use of standard UNet conventions, the model replaces the last convolution operation in multiple resolution blocks with the Transformer module, thereby capturing long-range dependencies.
Relative Positional Encoding: This feature is introduced to attend to the positional relationships in the images, aiding the self-attention mechanism in the model's adaptability to structural image data.

Experimental Results

The efficacy of UTNet is demonstrated in segmentation tasks using a multi-label, multi-vendor cardiac MRI dataset. The architecture showcases superior performance by outperforming state-of-the-art models, including UNet, ResUNet, and several attention-based enhancements with significant improvements in Dice scores. Notably, UTNet maintains an advantage in computational efficiency with fewer parameters compared to similar attention models while outperforming them in execution speed by substantial margins.

Robustness evaluation across different vendors reveals UTNet’s potential to generalize well, achieving reduced performance variance on unencountered datasets. This is attributed to its ability to leverage both local features and global contextual information effectively.

Implications and Future Directions

The implications of UTNet expand upon the potential for more adaptable and efficient integration of Transformers in domains with limited datasets, such as medical imaging. By reducing the reliance on pre-training with large datasets and exploiting the hybrid architecture's capability to overcome convolutional layer limitations, UTNet presents a significant advancement in the field of image segmentation.

Further research could explore adaptive hybrid architectures where the integration of different neural components might vary depending on data characteristics or task requirements. Extending this approach to other domains within computer vision could potentially yield improvements in tasks requiring precise boundary detection and complex feature representation.

Additionally, future efforts might focus on exploring the incorporation of this hybrid approach within other aspects of medical imaging, such as anomaly detection or reconstructive imaging, to enhance diagnostic accuracy and efficiency.

The UTNet model, as proposed, underscores the potential of combining state-of-the-art NLP architectures with traditional vision models, potentially charting a path for novel hybrid architectures in digital image processing and beyond.

PDF Markdown

Related Papers

YouTube

Show All Videos