nnFormer: Interleaved Transformer for Volumetric Segmentation

Published 7 Sep 2021 in cs.CV | (2109.03201v6)

Abstract: Transformer, the model of choice for natural language processing, has drawn scant attention from the medical imaging community. Given the ability to exploit long-term dependencies, transformers are promising to help atypical convolutional neural networks to overcome their inherent shortcomings of spatial inductive bias. However, most of recently proposed transformer-based segmentation approaches simply treated transformers as assisted modules to help encode global context into convolutional representations. To address this issue, we introduce nnFormer, a 3D transformer for volumetric medical image segmentation. nnFormer not only exploits the combination of interleaved convolution and self-attention operations, but also introduces local and global volume-based self-attention mechanism to learn volume representations. Moreover, nnFormer proposes to use skip attention to replace the traditional concatenation/summation operations in skip connections in U-Net like architecture. Experiments show that nnFormer significantly outperforms previous transformer-based counterparts by large margins on three public datasets. Compared to nnUNet, nnFormer produces significantly lower HD95 and comparable DSC results. Furthermore, we show that nnFormer and nnUNet are highly complementary to each other in model ensembling.

Abstract PDF Upgrade to Chat

Citations (211)

View on Semantic Scholar

Summary

The paper introduces a 3D transformer architecture that interleaves convolution with self-attention to enhance volumetric medical image segmentation.
It employs local and global volume-based self-attention along with skip attention to effectively aggregate features across network stages.
Experimental results demonstrate significant improvements, with lower Hausdorff Distance and higher Dice Similarity Coefficient in brain, organ, and cardiac tasks.

Overview of nnFormer: Volumetric Medical Image Segmentation via a 3D Transformer

In the paper titled "nnFormer: Volumetric Medical Image Segmentation via a 3D Transformer," Zhou et al. introduce an innovative approach to volumetric medical image segmentation by leveraging 3D transformers. Unlike conventional methodologies that often integrate transformers as auxiliary modules for global context encoding, nnFormer positions itself as a principal architecture, aiming to fully harness the power of both interleaved convolution and self-attention operations. The authors' design emphasizes a local and global volume-based self-attention mechanism and implements skip attention to enhance the segmentation performance.

Architecture and Methodology

The nnFormer model consists of an encoder-decoder architecture supplemented with a bottleneck section, drawing inspiration from, but extending beyond, the U-Net structure. Key features of nnFormer include:

Interleaved Convolution and Self-Attention: This combination allows for the retention of precise spatial information provided by convolutions and the integration of long-term dependencies captured through self-attention mechanisms.
Local and Global Volume-Based Self-Attention: Local Volume-based Multi-head Self-Attention (LV-MSA) and Global Volume-based Multi-head Self-Attention (GV-MSA) are employed to manage feature scaling and receptive field sizes, ensuring comprehensive 3D volume representation learning.
Skip Attention: Replacing traditional concatenation or summation in skip connections, skip attention facilitates effective feature aggregation across network stages.

Experimental Results

The research was evaluated using three public datasets addressing different medical imaging tasks: brain tumor segmentation, multi-organ segmentation, and cardiac diagnosis. Notably, the nnFormer model demonstrated substantial improvements over existing transformer-based approaches, achieving lower Hausdorff Distance (HD95) and higher Dice Similarity Coefficient (DSC) in several evaluations. A summary of notable results includes:

On the brain tumor segmentation task, nnFormer significantly reduced the average HD95 and enhanced DSC compared to baselines like UNETR.
In multi-organ segmentation, nnFormer outperformed other methods on most organ classes, particularly in accurately delineating complex anatomical structures such as the pancreas and stomach.
In cardiac diagnosis, nnFormer exhibited superior performance in segmenting cardiac structures compared to state-of-the-art approaches.

Implications and Future Directions

The introduction of nnFormer offers substantial implications for medical image analysis, particularly by enhancing segmentation accuracy and robustness across various volumetric datasets. The exploration of a full transformer-based framework with interleaved convolution positions nnFormer as a potent hybrid approach, paving the way for its adoption in clinical pipelines where segmentation accuracy is paramount.

The authors also highlight the potential for nnFormer and nnUNet to complement each other effectively, suggesting that further exploration into model ensembling strategies could yield additional improvements in medical image segmentation. Future developments might focus on optimizing computational efficiency and exploring the application of nnFormer to other domains beyond medical imaging, such as remote sensing or video segmentation, where volumetric data play a critical role. Further research could also elaborate on the adaptation of skip attention and its application to other neural network architectures, possibly refining its integration for broader use cases.

Markdown