Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation (2105.05537v1)

Published 12 May 2021 in eess.IV and cs.CV

Abstract: In the past few years, convolutional neural networks (CNNs) have achieved milestones in medical image analysis. Especially, the deep neural networks based on U-shaped architecture and skip-connections have been widely applied in a variety of medical image tasks. However, although CNN has achieved excellent performance, it cannot learn global and long-range semantic information interaction well due to the locality of the convolution operation. In this paper, we propose Swin-Unet, which is an Unet-like pure Transformer for medical image segmentation. The tokenized image patches are fed into the Transformer-based U-shaped Encoder-Decoder architecture with skip-connections for local-global semantic feature learning. Specifically, we use hierarchical Swin Transformer with shifted windows as the encoder to extract context features. And a symmetric Swin Transformer-based decoder with patch expanding layer is designed to perform the up-sampling operation to restore the spatial resolution of the feature maps. Under the direct down-sampling and up-sampling of the inputs and outputs by 4x, experiments on multi-organ and cardiac segmentation tasks demonstrate that the pure Transformer-based U-shaped Encoder-Decoder network outperforms those methods with full-convolution or the combination of transformer and convolution. The codes and trained models will be publicly available at https://github.com/HuCaoFighting/Swin-Unet.

PDF Abstract

Swin-Unet: A Transformer-Based Approach for Medical Image Segmentation

The research paper "Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation" by Hu Cao et al. presents a compelling approach that leverages the power of Transformer models for the task of medical image segmentation. The proposed method, termed Swin-Unet, innovatively integrates a pure Transformer architecture within a U-shaped Encoder-Decoder framework, traditionally dominated by Convolutional Neural Networks (CNNs).

The primary limitation of CNNs in medical image segmentation is their local receptive field, which impedes the ability to capture global and long-range semantic dependencies. This paper addresses this limitation by replacing the convolutional components with Swin Transformer blocks, capable of modeling both local and global contexts effectively.

Architectural Overview

Swin-Unet employs a hierarchical Swin Transformer with shifted windows as its principal encoding mechanism. The architecture maintains the U-shaped configuration with encoder, bottleneck, decoder, and skip connections:

Encoder and Bottleneck: The encoder consists of multiple Swin Transformer blocks interspersed with patch merging layers, which help in downsampling and dimension enhancement. The depth features are then processed by the bottleneck, constructed from additional Swin Transformer blocks.
Decoder: The decoder mirrors the encoder but employs patch expanding layers for upsampling the features and restoring spatial resolution.
Skip Connections: Similar to the traditional U-Net, multi-scale features from the encoder are concatenated with the decoder features to mitigate the loss of spatial resolution during downsampling.

Methodological Contributions

Pure Transformer Framework: Swin-Unet eschews convolutions entirely, utilizing Swin Transformer blocks for both encoding and decoding. By doing so, it successfully captures local to global semantic representations, addressing the intrinsic locality shortfall of CNNs.
Patch Expanding Layer: This layer is a novel upsampling mechanism that avoids the convolution or interpolation operations typically seen in other architectures, resulting in enhanced upsampling and feature dimension adjustments.
Skip Connection Efficacy: The paper validates the effectiveness of skip connections in a Transformer-based model, ensuring that spatial information is preserved effectively.

Empirical Results

The performance of Swin-Unet is evaluated on two datasets: the Synapse multi-organ segmentation dataset and the ACDC automated cardiac diagnosis challenge dataset. The model achieves a Dice Similarity Coefficient (DSC) of 79.13% on the Synapse dataset and 90.00% on the ACDC dataset, outperforming existing methods, including CNN-based U-Net variants and hybrid architectures combining CNNs with Transformers.

Implications and Future Work

The success demonstrated by Swin-Unet suggests substantial implications for medical image segmentation. Primarily, it indicates that Transformers, with their capability for modeling long-range dependencies, can outperform traditional convolutions even in dense pixel-level prediction tasks customary to medical imaging.

Practical Implications

Improved Accuracy: The model's ability to capture global context leads to more accurate segmentations, crucial for applications like computer-aided diagnosis and image-guided surgery.
Edge Prediction: Enhanced performance on Hausdorff Distance (HD) reflects better edge predictions, important for delineating organ boundaries in medical images.

Theoretical Implications

Transformer Pre-training: Exploring pre-training strategies specific to medical imaging can potentially improve performance further by leveraging domain-specific characteristics.
3D Medical Imaging: The current implementation deals with 2D images; however, application to 3D volumetric data remains an intriguing and practical extension, given the predominance of 3D data in clinical practice.

Conclusions

Swin-Unet's architecture establishes a benchmark for utilizing pure Transformer models in medical image segmentation. Through extensive experimentation and strategic architectural choices, it sets a path forward for integrating global context-awareness into medical image analysis tasks. The paper opens avenues for future research in pre-training Transformers for medical imaging and expanding the methodology to 3D data, furthering the integration of advanced machine learning techniques in clinical diagnostics.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Hu Cao (17 papers)
Yueyue Wang (4 papers)
Joy Chen (3 papers)
Dongsheng Jiang (13 papers)
Xiaopeng Zhang (100 papers)
Qi Tian (314 papers)
Manning Wang (33 papers)

Citations (2,258)

View on Semantic Scholar