TransFuse: Fusing Transformers and CNNs for Medical Image Segmentation
Overview
The paper "TransFuse: Fusing Transformers and CNNs for Medical Image Segmentation" proposes a novel architecture, TransFuse, which aims to optimize the task of medical image segmentation by combining two different types of neural network architectures: Convolutional Neural Networks (CNNs) and Transformers. This hybrid model is designed to leverage the strengths of both CNNs and Transformers by integrating them in a parallel-in-branch fashion, thereby addressing the limitations inherent to each architecture when used independently.
Introduction
CNNs have achieved considerable success in medical image segmentation tasks owing to their ability to extract hierarchical task-specific features. However, these networks often struggle with capturing global context due to the inherently local nature of convolution operations and aggressive downsampling strategies, which can lead to a loss of localized details and diminished feature reuse. Transformers, originally developed for NLP tasks, excel at capturing long-range dependencies via self-attention mechanisms. Despite their advantages in modeling global context, Transformers alone are less effective in capturing fine-grained details essential to dense prediction tasks like image segmentation.
Proposed Methodology
TransFuse employs a dual-branch architecture where a CNN branch and a Transformer branch run in parallel. The CNN branch focuses on local feature extraction through progressive downsampling, while the Transformer branch efficiently models global context at a wider reception. The information from both branches is integrated using a novel BiFusion module, which combines self-attention and the Hadamard product to selectively fuse multi-level features.
Transformer Branch
The Transformer branch follows a typical encoder-decoder architecture. The input image is divided into patches, flattened, and passed through a linear embedding layer. The encoded sequence undergoes multiple layers of Multi-Head Self-Attention (MSA) and Multi-Layer Perceptrons (MLPs). Layer normalization is applied, and the progressive upsampling (PUP) method is utilized for decoding, converting encoded sequences back into a higher-resolution spatial map.
CNN Branch
The shallow CNN branch captures low-level features and encodes them from local to global through gradually increasing receptive fields. Outputs from intermediate convolution layers are forwarded for fusion with corresponding Transformer features.
BiFusion Module
The BiFusion module incorporates channel-attention, spatial-attention, and bilinear Hadamard product to harness characteristics from both branches. Fused features are enriched by combining responses from attended CNN and Transformer features and processed further to generate the final segmentation using gated skip-connections.
Experimental Results
The efficacy of TransFuse was demonstrated on various medical image segmentation datasets, including polyp, skin lesion, hip, and prostate. TransFuse achieved superior performance evidenced by significant improvements in key metrics like mean Dice coefficient (mDice) and mean Intersection-Over-Union (mIoU).
Polyp Segmentation
TransFuse outperformed state-of-the-art models such as HarDNet-MSEG and PraNet with an increased mDice score. TransFuse-S exhibited an enhancement in both computational efficiency and segmentation accuracy, running at 98.7 FPS with just 26.3M parameters.
Skin Lesion Segmentation
The model delivered high performance on the ISIC 2017 dataset, achieving a higher Jaccard Index compared to leading methods such as SLSDeep, thereby underscoring its capacity to generalize effectively to different segmentation tasks.
Hip Segmentation
On the hip segmentation task, TransFuse-S reduced the Hausdorff Distance (HD) and Average Surface Distance (ASD) substantially, demonstrating its proficiency in capturing precise spatial details essential for clinical applications.
Prostate Segmentation
In 3D prostate MRI segmentation, TransFuse outperformed the nnU-Net framework, revealing the potential of the parallel-in-branch design for volumetric medical data.
Conclusion and Future Work
The innovative integration of CNNs and Transformers into the TransFuse architecture provides a robust solution for medical image segmentation, adeptly balancing the capture of global dependencies and preservation of localized details. The parallel-in-branch approach, coupled with the BiFusion module, establishes a new paradigm in medical imaging by significantly enhancing both segmentation accuracy and computational efficiency.
Future research may explore enhancing the efficiency of Transformer layers further and applying the TransFuse architecture to other medical tasks, including landmark detection and disease classification.
This work marks a step towards more adaptable, accurate, and efficient AI models for medical imaging, offering a promising avenue for subsequent advancements in the field.