Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

TransFuse: Fusing Transformers and CNNs for Medical Image Segmentation (2102.08005v2)

Published 16 Feb 2021 in cs.CV and cs.AI

Abstract: Medical image segmentation - the prerequisite of numerous clinical needs - has been significantly prospered by recent advances in convolutional neural networks (CNNs). However, it exhibits general limitations on modeling explicit long-range relation, and existing cures, resorting to building deep encoders along with aggressive downsampling operations, leads to redundant deepened networks and loss of localized details. Hence, the segmentation task awaits a better solution to improve the efficiency of modeling global contexts while maintaining a strong grasp of low-level details. In this paper, we propose a novel parallel-in-branch architecture, TransFuse, to address this challenge. TransFuse combines Transformers and CNNs in a parallel style, where both global dependency and low-level spatial details can be efficiently captured in a much shallower manner. Besides, a novel fusion technique - BiFusion module is created to efficiently fuse the multi-level features from both branches. Extensive experiments demonstrate that TransFuse achieves the newest state-of-the-art results on both 2D and 3D medical image sets including polyp, skin lesion, hip, and prostate segmentation, with significant parameter decrease and inference speed improvement.

TransFuse: Fusing Transformers and CNNs for Medical Image Segmentation

Overview

The paper "TransFuse: Fusing Transformers and CNNs for Medical Image Segmentation" proposes a novel architecture, TransFuse, which aims to optimize the task of medical image segmentation by combining two different types of neural network architectures: Convolutional Neural Networks (CNNs) and Transformers. This hybrid model is designed to leverage the strengths of both CNNs and Transformers by integrating them in a parallel-in-branch fashion, thereby addressing the limitations inherent to each architecture when used independently.

Introduction

CNNs have achieved considerable success in medical image segmentation tasks owing to their ability to extract hierarchical task-specific features. However, these networks often struggle with capturing global context due to the inherently local nature of convolution operations and aggressive downsampling strategies, which can lead to a loss of localized details and diminished feature reuse. Transformers, originally developed for NLP tasks, excel at capturing long-range dependencies via self-attention mechanisms. Despite their advantages in modeling global context, Transformers alone are less effective in capturing fine-grained details essential to dense prediction tasks like image segmentation.

Proposed Methodology

TransFuse employs a dual-branch architecture where a CNN branch and a Transformer branch run in parallel. The CNN branch focuses on local feature extraction through progressive downsampling, while the Transformer branch efficiently models global context at a wider reception. The information from both branches is integrated using a novel BiFusion module, which combines self-attention and the Hadamard product to selectively fuse multi-level features.

Transformer Branch

The Transformer branch follows a typical encoder-decoder architecture. The input image is divided into patches, flattened, and passed through a linear embedding layer. The encoded sequence undergoes multiple layers of Multi-Head Self-Attention (MSA) and Multi-Layer Perceptrons (MLPs). Layer normalization is applied, and the progressive upsampling (PUP) method is utilized for decoding, converting encoded sequences back into a higher-resolution spatial map.

CNN Branch

The shallow CNN branch captures low-level features and encodes them from local to global through gradually increasing receptive fields. Outputs from intermediate convolution layers are forwarded for fusion with corresponding Transformer features.

BiFusion Module

The BiFusion module incorporates channel-attention, spatial-attention, and bilinear Hadamard product to harness characteristics from both branches. Fused features are enriched by combining responses from attended CNN and Transformer features and processed further to generate the final segmentation using gated skip-connections.

Experimental Results

The efficacy of TransFuse was demonstrated on various medical image segmentation datasets, including polyp, skin lesion, hip, and prostate. TransFuse achieved superior performance evidenced by significant improvements in key metrics like mean Dice coefficient (mDice) and mean Intersection-Over-Union (mIoU).

Polyp Segmentation

TransFuse outperformed state-of-the-art models such as HarDNet-MSEG and PraNet with an increased mDice score. TransFuse-S exhibited an enhancement in both computational efficiency and segmentation accuracy, running at 98.7 FPS with just 26.3M parameters.

Skin Lesion Segmentation

The model delivered high performance on the ISIC 2017 dataset, achieving a higher Jaccard Index compared to leading methods such as SLSDeep, thereby underscoring its capacity to generalize effectively to different segmentation tasks.

Hip Segmentation

On the hip segmentation task, TransFuse-S reduced the Hausdorff Distance (HD) and Average Surface Distance (ASD) substantially, demonstrating its proficiency in capturing precise spatial details essential for clinical applications.

Prostate Segmentation

In 3D prostate MRI segmentation, TransFuse outperformed the nnU-Net framework, revealing the potential of the parallel-in-branch design for volumetric medical data.

Conclusion and Future Work

The innovative integration of CNNs and Transformers into the TransFuse architecture provides a robust solution for medical image segmentation, adeptly balancing the capture of global dependencies and preservation of localized details. The parallel-in-branch approach, coupled with the BiFusion module, establishes a new paradigm in medical imaging by significantly enhancing both segmentation accuracy and computational efficiency.

Future research may explore enhancing the efficiency of Transformer layers further and applying the TransFuse architecture to other medical tasks, including landmark detection and disease classification.

This work marks a step towards more adaptable, accurate, and efficient AI models for medical imaging, offering a promising avenue for subsequent advancements in the field.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Yundong Zhang (7 papers)
  2. Huiye Liu (2 papers)
  3. Qiang Hu (149 papers)
Citations (784)