DS-TransUNet:Dual Swin Transformer U-Net for Medical Image Segmentation (2106.06716v1)

Published 12 Jun 2021 in cs.CV

Abstract: Automatic medical image segmentation has made great progress benefit from the development of deep learning. However, most existing methods are based on convolutional neural networks (CNNs), which fail to build long-range dependencies and global context connections due to the limitation of receptive field in convolution operation. Inspired by the success of Transformer in modeling the long-range contextual information, some researchers have expended considerable efforts in designing the robust variants of Transformer-based U-Net. Moreover, the patch division used in vision transformers usually ignores the pixel-level intrinsic structural features inside each patch. To alleviate these problems, we propose a novel deep medical image segmentation framework called Dual Swin Transformer U-Net (DS-TransUNet), which might be the first attempt to concurrently incorporate the advantages of hierarchical Swin Transformer into both encoder and decoder of the standard U-shaped architecture to enhance the semantic segmentation quality of varying medical images. Unlike many prior Transformer-based solutions, the proposed DS-TransUNet first adopts dual-scale encoder subnetworks based on Swin Transformer to extract the coarse and fine-grained feature representations of different semantic scales. As the core component for our DS-TransUNet, a well-designed Transformer Interactive Fusion (TIF) module is proposed to effectively establish global dependencies between features of different scales through the self-attention mechanism. Furthermore, we also introduce the Swin Transformer block into decoder to further explore the long-range contextual information during the up-sampling process. Extensive experiments across four typical tasks for medical image segmentation demonstrate the effectiveness of DS-TransUNet, and show that our approach significantly outperforms the state-of-the-art methods.

View on arXiv

Authors (5)

Ailiang Lin (1 paper)
Bingzhi Chen (5 papers)
Jiayu Xu (4 papers)
Zheng Zhang (488 papers)
Guangming Lu (49 papers)

Citations (500)

View on Semantic Scholar

Summary

DS-TransUNet: Dual Swin Transformer U-Net for Medical Image Segmentation

This paper introduces DS-TransUNet, a novel encoder-decoder architecture leveraging dual Swin Transformers for enhancing medical image segmentation tasks. The primary aim is to address the limitations of convolutional neural networks (CNNs) in capturing long-range dependencies by incorporating advancements from Transformers, particularly focusing on the Swin Transformer.

Key Contributions

Integration of Swin Transformer: The framework leverages Swin Transformer blocks both in the encoder and decoder. The Swin Transformer, noted for its hierarchical architecture and efficient computation through window-based multi-head self-attention, provides a robust method for embedding long-range contextual information into the segmentation process.
Dual-Scale Encoder: By utilizing a dual-scale encoder, the model captures multi-scale feature representations. This setup facilitates the extraction of both coarse and fine-grained semantic information, thus enhancing the quality of segmentation annotations across varying medical images.
Transformer Interactive Fusion (TIF) Module: The TIF module establishes global dependencies between different semantic scales. It harnesses the self-attention mechanism to fuse multi-scale features effectively, ensuring coherent semantic representations.
Improved Performance on Segmentation Tasks: Empirical evaluation across multiple datasets — including polyp segmentation, ISIC 2018, GLAS, and the 2018 Data Science Bowl — demonstrates the model's superiority over existing state-of-the-art architectures. The DS-TransUNet showed marked improvements particularly in polyp segmentation tasks, underscoring the effectiveness of integrating Swin Transformers.

Numerical Results

The experiments reveal significant improvements in mDice and mIoU metrics compared to state-of-the-art methods. For instance, DS-TransUNet achieved a considerable increase in segmentation accuracy on unseen datasets, highlighting its generalization capability.

Theoretical and Practical Implications

Theoretically, this paper suggests that incorporating self-attention mechanisms within both encoder and decoder stages provides substantial enhancements in modeling long-range dependencies more effectively than CNN stacks. Practically, this could inform the design of future segmentation models, particularly in domains requiring high precision, such as medical diagnostics.

Future Directions

Potential future research avenues include the exploration of lightweight Transformer-based models, which could maintain the segmentation performance while reducing computational demands. Additionally, addressing pixel-level features lost during patch division remains an area for further enhancement.

In summary, the DS-TransUNet framework is an insightful extension to the domain of medical image segmentation, demonstrating that Transformer architectures can significantly enhance representational capacity and segmentation accuracy beyond conventional CNN-based approaches.

PDF Markdown

Related Papers

Find Related Papers