TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation (2102.04306v1)

Published 8 Feb 2021 in cs.CV

Abstract: Medical image segmentation is an essential prerequisite for developing healthcare systems, especially for disease diagnosis and treatment planning. On various medical image segmentation tasks, the u-shaped architecture, also known as U-Net, has become the de-facto standard and achieved tremendous success. However, due to the intrinsic locality of convolution operations, U-Net generally demonstrates limitations in explicitly modeling long-range dependency. Transformers, designed for sequence-to-sequence prediction, have emerged as alternative architectures with innate global self-attention mechanisms, but can result in limited localization abilities due to insufficient low-level details. In this paper, we propose TransUNet, which merits both Transformers and U-Net, as a strong alternative for medical image segmentation. On one hand, the Transformer encodes tokenized image patches from a convolution neural network (CNN) feature map as the input sequence for extracting global contexts. On the other hand, the decoder upsamples the encoded features which are then combined with the high-resolution CNN feature maps to enable precise localization. We argue that Transformers can serve as strong encoders for medical image segmentation tasks, with the combination of U-Net to enhance finer details by recovering localized spatial information. TransUNet achieves superior performances to various competing methods on different medical applications including multi-organ segmentation and cardiac segmentation. Code and models are available at https://github.com/Beckschen/TransUNet.

PDF Abstract

TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation

TransUNet, introduced by Jieneng Chen et al., presents a novel architecture leveraging transformers for medical image segmentation tasks. In view of the limitations of the prevalent U-Net architecture in modeling long-range dependencies due to the inherent locality of convolution operations, this paper addresses this issue by integrating transformers, which are adept at capturing global contexts, into the medical image segmentation framework.

Background

Medical image segmentation is pivotal for various healthcare applications, aiding in disease diagnosis and treatment planning. Typically, CNN-based architectures, specifically U-Net, have dominated this domain due to their symmetric encoder-decoder structure and skip connections facilitating detailed retention. However, U-Net exhibits weaknesses in modeling long-range dependencies, a critical aspect for accurately segmenting medical images that may exhibit significant inter-patient variability.

Transformers, initially designed for sequence-to-sequence tasks in NLP, inherently possess a global self-attention mechanism, making them suitable for capturing long-range dependencies. Despite this strength, transformers can struggle with localization, often yielding low-resolution features that miss finer details necessary for precise segmentation.

Contributions

TransUNet proposes a hybrid framework that marries the strengths of both transformers and U-Net:

Transformer Encoder: The model tokenizes image patches from a CNN-produced feature map, extracting global contexts using transformers.
CNN-Transformer Hybrid Encoder: A combined approach where a CNN first extracts feature maps which are then processed by the transformer for sequence prediction.
Cascaded Upsampler with Skip-Connections: Employing a cascaded upsampler architecture that integrates intermediate high-resolution CNN features via skip-connections, enabling detailed, high-resolution segmentation.

Performance and Results

TransUNet was evaluated on multiple medical segmentation datasets, including the Synapse multi-organ segmentation dataset and the ACDC dataset for cardiac segmentation. The paper reports notable improvements in segmentation performance using TransUNet compared to existing methods, validated by detailed experiments. Key findings include:

Synapse Dataset:
- TransUNet significantly outperformed models like V-Net, DARR, and variants of U-Net and AttnUNet.
- The model achieved an average DSC of 77.48% and an average Hausdorff Distance of 31.69 mm across various organs.
ACDC Dataset:
- Consistent improvements were observed with TransUNet achieving an average DSC of 89.71%.

The analysis underscores the efficacy of the hybrid approach, where combining transformers for global context extraction with CNN's high-resolution feature retention yields superior segmentation performance.

Implications and Future Directions

TransUNet’s architecture showcases the potential of transformers in medical segmentation tasks. This method capitalizes on transformers' ability to model global context while mitigating their limitations in low-detail information through hybrid integration with CNNs. The framework opens avenues for further research in hybrid architectures that can be extended to other imaging modalities and diseases, potentially streamlining and improving diagnostic and treatment workflows in clinical settings.

Future developments may involve exploring more sophisticated skip-connection mechanisms or investigating the scalability of the model to even higher input resolutions and larger datasets. Additionally, integrating transformers more deeply into various stages of the architectural pipeline could yield further improvements.

In conclusion, TransUNet stands as a robust alternative for medical image segmentation, setting a new benchmark by combining the best of both CNN and transformer architectures. By effectively leveraging the global and local contextual features, TransUNet advances the field towards more accurate and reliable segmentation outcomes.

PDF Markdown Bookmark Chat (Pro)

Authors (9)

Jieneng Chen (26 papers)
Yongyi Lu (27 papers)
Qihang Yu (44 papers)
Xiangde Luo (31 papers)
Ehsan Adeli (97 papers)
Yan Wang (733 papers)
Le Lu (148 papers)
Alan L. Yuille (72 papers)
Yuyin Zhou (92 papers)

Citations (2,738)

View on Semantic Scholar

TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation (2102.04306v1)