UNetFormer: A Unified Vision Transformer Model and Pre-Training Framework for 3D Medical Image Segmentation (2204.00631v2)

Published 1 Apr 2022 in eess.IV, cs.AI, cs.CV, and cs.LG

Abstract: Vision Transformers (ViT)s have recently become popular due to their outstanding modeling capabilities, in particular for capturing long-range information, and scalability to dataset and model sizes which has led to state-of-the-art performance in various computer vision and medical image analysis tasks. In this work, we introduce a unified framework consisting of two architectures, dubbed UNetFormer, with a 3D Swin Transformer-based encoder and Convolutional Neural Network (CNN) and transformer-based decoders. In the proposed model, the encoder is linked to the decoder via skip connections at five different resolutions with deep supervision. The design of proposed architecture allows for meeting a wide range of trade-off requirements between accuracy and computational cost. In addition, we present a methodology for self-supervised pre-training of the encoder backbone via learning to predict randomly masked volumetric tokens using contextual information of visible tokens. We pre-train our framework on a cohort of $5050$ CT images, gathered from publicly available CT datasets, and present a systematic investigation of various components such as masking ratio and patch size that affect the representation learning capability and performance of downstream tasks. We validate the effectiveness of our pre-training approach by fine-tuning and testing our model on liver and liver tumor segmentation task using the Medical Segmentation Decathlon (MSD) dataset and achieve state-of-the-art performance in terms of various segmentation metrics. To demonstrate its generalizability, we train and test the model on BraTS 21 dataset for brain tumor segmentation using MRI images and outperform other methods in terms of Dice score. Code: https://github.com/Project-MONAI/research-contributions

PDF Abstract

Overview of "UNetFormer: A Unified Vision Transformer Model and Pre-Training Framework for 3D Medical Image Segmentation"

The paper introduces UNetFormer, a sophisticated vision transformer model designed for 3D medical image segmentation. It leverages a 3D Swin Transformer-based encoder, coupled with both CNN and transformer-based decoders, to address the segmentation challenges in medical imaging. This model integrates the advantages of Vision Transformers (ViTs) in capturing long-range dependencies and their scalability with the efficiency of Convolutional Neural Networks (CNNs) when processing layered image data.

Methodological Insights

UNetFormer comprises two primary architectures:

UNetFormer with CNN-based Decoder: This variant combines the Swin Transformer encoder with a CNN-decoder, utilizing skip connections at multiple resolutions. It capitalizes on CNNs for accurate spatial localization of features.
UNetFormer+ with Transformer-based Decoder: This configuration replaces the CNN decoder with a transformer-based one, aiming to improve global context understanding at each layer.

Both architectures are designed to balance computational efficiency and segmentation accuracy, making them adaptable to various clinical contexts.

Pre-Training Strategy

A significant contribution of the paper is its proposal of a self-supervised pre-training strategy using masked image modeling. The model learns to predict randomly masked volumetric tokens by leveraging the visible context, akin to BERT-style pre-training in NLP. This approach is especially valuable in medical imaging, where annotated data is limited, enabling efficient transfer learning and better performance on downstream tasks.

Experimental Validation

The UNetFormer models were pre-trained on 5050 CT images and fine-tuned on the Medical Segmentation Decathlon (MSD) and BraTS 21 datasets. The results underscore significant performance improvements:

Liver and Liver Tumor Segmentation: UNetFormer achieved a Dice score of 96.03% and 59.16% for liver and liver tumors, respectively. It outperformed nnUNet and nnFormer by substantial margins.
Brain Tumor Segmentation: A Dice score of 91.54% was attained, exceeding earlier methods, demonstrating impressive generalizability across different segmentation tasks.

The approach's effectiveness is further bolstered by its ability to provide high-quality segmentation outputs while maintaining moderate computational requirements. The proposed methodology shows sensitivity to configurations like masking ratio and patch size, offering insights into optimal pre-training settings.

Practical and Theoretical Implications

Practically, UNetFormer offers a robust solution for clinical applications where accurate segmentation is crucial, such as in tumor delineation and volumetric analysis. Theoretically, it advances the area of medical image analysis by integrating and optimizing transformer-based models with CNNs, offering a comprehensive framework for leveraging large-scale unannotated datasets.

Future Directions

The integration of ViTs in medical imaging could be further explored by assessing the impact of different transformer configurations and pre-training objectives. Additionally, examining the reconstruction quality of pre-trained models in relation to downstream task effectiveness could yield deeper insights. The potential of hybrid models and their application in other domains of medical imaging remain promising areas for research.

In summary, the UNetFormer framework represents an important evolution in leveraging transformer-based architectures for 3D medical image segmentation, providing a foundation for future advancements in medical image analysis.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Ali Hatamizadeh (33 papers)
Ziyue Xu (58 papers)
Dong Yang (163 papers)
Wenqi Li (59 papers)
Holger Roth (34 papers)
Daguang Xu (91 papers)

Citations (23)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - Project-MONAI/research-contributions: Implementations of recent research prototypes/demonstrations using MONAI. (1,037 stars)