Self Pre-training with Masked Autoencoders for Medical Image Classification and Segmentation (2203.05573v2)

Published 10 Mar 2022 in eess.IV, cs.CV, and cs.LG

Abstract: Masked Autoencoder (MAE) has recently been shown to be effective in pre-training Vision Transformers (ViT) for natural image analysis. By reconstructing full images from partially masked inputs, a ViT encoder aggregates contextual information to infer masked image regions. We believe that this context aggregation ability is particularly essential to the medical image domain where each anatomical structure is functionally and mechanically connected to other structures and regions. Because there is no ImageNet-scale medical image dataset for pre-training, we investigate a self pre-training paradigm with MAE for medical image analysis tasks. Our method pre-trains a ViT on the training set of the target data instead of another dataset. Thus, self pre-training can benefit more scenarios where pre-training data is hard to acquire. Our experimental results show that MAE self pre-training markedly improves diverse medical image tasks including chest X-ray disease classification, abdominal CT multi-organ segmentation, and MRI brain tumor segmentation. Code is available at https://github.com/cvlab-stonybrook/SelfMedMAE

PDF Abstract

Self Pre-training with Masked Autoencoders for Medical Image Classification and Segmentation

The paper presented in "Self Pre-training with Masked Autoencoders for Medical Image Classification and Segmentation" explores the utilization of Masked Autoencoders (MAE) as a self-supervised pre-training approach tailored specifically for Vision Transformers (ViT) in the domain of medical imaging. This research builds upon the context aggregation capability of MAEs, which is of particular importance for medical image analysis due to the inherent anatomical and functional interdependencies within human body structures.

Methodology Overview

The core of the approach lies in leveraging ViT architectures. Specifically, the paper employs ViT as the backbone for both self-pre-training and subsequent downstream tasks. The MAE framework, notable for its asymmetric encoder-decoder architecture, operates by masking a random set of image patches and tasking the encoder with learning from the visible patches to reconstruct the entire image. This reconstruction process relies on using a lightweight decoder which helps the encoder focus on aggregation of contextual information from partial observations. This is crucial because medical images, especially CT and MRI scans, often entail complex structural relationships which need to be contextualized for accurate analysis.

For pre-training, the authors propose the self pre-training paradigm. Unlike traditional transfer learning where models are initially trained on large-scale datasets like ImageNet, this paradigm employs the same dataset for both pre-training and task-specific finetuning. This not only mitigates the challenge of acquiring vast pre-training datasets specifically for medical imaging but also alleviates the domain discrepancy issues commonly seen when transferring models from general-purpose datasets.

Experimental Results

The paper demonstrates the efficacy of MAE self pre-training across multiple medical image tasks:

Chest X-ray Disease Classification: On the ChestX-ray14 dataset, which consists of over 112,000 images, the MAE self pre-trained ViT achieved a performance gain compared to models even pre-trained on ImageNet, showcasing its potential to outperform conventional methodologies.
Abdominal CT Segmentation: When applied to the BTCV dataset for multi-organ segmentation, MAE self pre-training significantly improved the performance, as indicated by a notable increase in Dice Similarity Coefficient (DSC) for various abdominal organs.
MRI Brain Tumor Segmentation: Similarly, in the case of the Medical Segmentation Decathlon’s brain tumor segmentation task, the framework achieved superior results on key performance metrics, further validating its applicability across diverse medical imaging modalities and tasks.

Implications and Speculation on Future Developments

The research suggests that self-supervised learning paradigms, particularly MAE, have considerable potential in medical imaging applications where large annotated datasets are scarce. By focusing on the dataset pertinent to the task, self pre-training not only optimizes the model for improved performance but also ensures better utilization of the limited data. Future research can explore extending this approach to other imaging modalities or pathologies, potentially broadening the framework’s applicability.

Additionally, the integration of context aggregation into the self-supervised learning paradigm presents new opportunities for enhancing model interpretability, a critical requirement in medical applications. It's plausible that future innovations might focus on refining this aspect to provide richer, more interpretable insights from medical data.

In summary, the paper posits MAE self pre-training as an effective strategy for advancing medical image analysis, presenting strong numerical results across varied tasks. The promising performance improvements observed suggest a path forward for the integration of self-supervised learning into clinical workflows, provided further validation across even larger datasets and more diverse medical imaging applications.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Lei Zhou (126 papers)
Huidong Liu (13 papers)
Joseph Bae (14 papers)
Junjun He (77 papers)
Dimitris Samaras (125 papers)
Prateek Prasanna (47 papers)

Citations (49)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - cvlab-stonybrook/SelfMedMAE: Code for ISBI 2023 paper "Self Pre-training with Masked Autoencoders for Medical Image Classification and Segmentation" (105 stars)