LeViT-UNet: Make Faster Encoders with Transformer for Medical Image Segmentation (2107.08623v1)

Published 19 Jul 2021 in cs.CV

Abstract: Medical image segmentation plays an essential role in developing computer-assisted diagnosis and therapy systems, yet still faces many challenges. In the past few years, the popular encoder-decoder architectures based on CNNs (e.g., U-Net) have been successfully applied in the task of medical image segmentation. However, due to the locality of convolution operations, they demonstrate limitations in learning global context and long-range spatial relations. Recently, several researchers try to introduce transformers to both the encoder and decoder components with promising results, but the efficiency requires further improvement due to the high computational complexity of transformers. In this paper, we propose LeViT-UNet, which integrates a LeViT Transformer module into the U-Net architecture, for fast and accurate medical image segmentation. Specifically, we use LeViT as the encoder of the LeViT-UNet, which better trades off the accuracy and efficiency of the Transformer block. Moreover, multi-scale feature maps from transformer blocks and convolutional blocks of LeViT are passed into the decoder via skip-connection, which can effectively reuse the spatial information of the feature maps. Our experiments indicate that the proposed LeViT-UNet achieves better performance comparing to various competing methods on several challenging medical image segmentation benchmarks including Synapse and ACDC. Code and models will be publicly available at https://github.com/apple1986/LeViT_UNet.

View on arXiv

Authors (4)

Guoping Xu (7 papers)
Xingrong Wu (1 paper)
Xuan Zhang (183 papers)
Xinwei He (25 papers)

Citations (162)

View on Semantic Scholar

Summary

LeViT-UNet: Enhancing Medical Image Segmentation with a Transformer-Based Architecture

The paper "LeViT-UNet: Make Faster Encoders with Transformer for Medical Image Segmentation" introduces LeViT-UNet, a novel architecture that incorporates LeViT Transformers into the traditional U-Net framework for medical image segmentation tasks. This research emerges from the need to address the limitations of CNNs in capturing global contexts due to their locality of convolution operations. Existing encoder-decoder architectures like U-Net, while effective, struggle with long-range spatial dependencies. In contrast, Transformers demonstrate superior capabilities for long-range context modeling, but often at the expense of increased computational complexity.

LeViT-UNet innovatively integrates LeViT, a hybrid model that combines convolutional and Transformer blocks, as the encoder within a U-Net configuration, optimizing the accuracy-efficiency trade-off pivotal for medical image segmentation. This design utilizes multi-scale feature maps from both transformer and convolutional blocks, linking them to the decoders through skip-connections, thereby enhancing both spatial comprehension and segmentation performance.

Key Findings

The authors conducted extensive experiments on medical image segmentation benchmarks, specifically Synapse and ACDC datasets. Results revealed that LeViT-UNet-384 outperforms competing methods, achieving the highest accuracy with a Dice Similarity Coefficient (DSC) of 78.53% and a Hausdorff Distance (HD) of 16.84 mm on the Synapse dataset. These metrics reflect a combination of improved boundary delineation and spatial accuracy over previous state-of-the-art methods, both CNN and Transformer-based, including TransUNet and Swin-UNet.

Furthermore, LeViT-UNet demonstrates versatility in executing fast and accurate segmentation while maintaining efficiency. It surpasses existing models in critical edge prediction, emphasizing its capability in generating smoother and more precise segmentations. The architecture demonstrates its strength in managing computational demands, with LeViT-UNet-384 significantly reducing the HD when compared against its CNN counterparts.

Architectural Innovations

The LeViT-UNet architecture combines the benefits of CNNs and Transformers within a U-shaped design, optimizing for both global and local feature representation. The LeViT encoder, adapted here, comprises convolutional layers followed by transformer blocks, thereby allowing a significant reduction in floating-point operations (FLOPs). The novel feature map fusion strategy ensures the effective integration of spatial information across various resolution scales via skip-connections.

Implications and Future Directions

LeViT-UNet sets a new benchmark in the field of medical image segmentation, promising a balance of high accuracy and real-time processing capabilities crucial for clinical applications. It opens avenues for further exploration of fast and accurate architectures incorporating Transformers, particularly in 3D medical imaging contexts.

The future work may focus on refining multi-scale feature fusion strategies and investigating more efficient architectural designs that sustain the trade-off between speed and precision. There is also significant potential in expanding the utility of LeViT-UNet to other domain-specific segmentation challenges, which require handling complex image characteristics and segmentation variability.

The contribution of LeViT-UNet reflects a promising stride towards utilizing advanced Transformer-based models in real-world medical deployments, with the potential for significant improvements in automated diagnostic systems.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - apple1986/LeViT-UNet: For medical image segmentation (38 stars)