Medical Transformer: Gated Axial-Attention for Medical Image Segmentation (2102.10662v2)

Published 21 Feb 2021 in cs.CV

Abstract: Over the past decade, Deep Convolutional Neural Networks have been widely adopted for medical image segmentation and shown to achieve adequate performance. However, due to the inherent inductive biases present in the convolutional architectures, they lack understanding of long-range dependencies in the image. Recently proposed Transformer-based architectures that leverage self-attention mechanism encode long-range dependencies and learn representations that are highly expressive. This motivates us to explore Transformer-based solutions and study the feasibility of using Transformer-based network architectures for medical image segmentation tasks. Majority of existing Transformer-based network architectures proposed for vision applications require large-scale datasets to train properly. However, compared to the datasets for vision applications, for medical imaging the number of data samples is relatively low, making it difficult to efficiently train transformers for medical applications. To this end, we propose a Gated Axial-Attention model which extends the existing architectures by introducing an additional control mechanism in the self-attention module. Furthermore, to train the model effectively on medical images, we propose a Local-Global training strategy (LoGo) which further improves the performance. Specifically, we operate on the whole image and patches to learn global and local features, respectively. The proposed Medical Transformer (MedT) is evaluated on three different medical image segmentation datasets and it is shown that it achieves better performance than the convolutional and other related transformer-based architectures. Code: https://github.com/jeya-maria-jose/Medical-Transformer

PDF Abstract

Medical Transformer: Gated Axial-Attention for Medical Image Segmentation

The paper "Medical Transformer: Gated Axial-Attention for Medical Image Segmentation" by Jeya Maria Jose Valanarasu et al. proposes a novel approach to medical image segmentation leveraging transformer-based architectures. Transformers have revolutionized natural language processing by their ability to model long-range dependencies via self-attention mechanisms, which coax researchers to explore their applicability in computer vision tasks. This paper introduces a transformer-based solution tailored specifically to medical image segmentation which addresses the current limitations posed by convolutional neural networks (ConvNets).

Medical image segmentation plays a pivotal role in clinical settings, aiding in accurate diagnosis and precise planning for surgical interventions. Although ConvNets like U-Net, V-Net, and their variants have achieved commendable performance in medical image segmentation, their inherent convolutional structures constrain their ability to grasp long-range dependencies in images. The proposed Medical Transformer (MedT) aims to remedy this limitation by utilizing an innovative gated axial-attention mechanism and a Local-Global training strategy (LoGo).

Methodological Contributions

The MedT is composed of several key innovations, each contributing to its superior performance:

Gated Axial-Attention Mechanism: Traditional axial-attention mechanisms, albeit efficient, assume large datasets to effectively learn positional encodings—an assumption unmet in medical imaging due to scarce labeled data. The proposed gated axial-attention layer incorporates learnable gates that modulate the influence of positional embeddings. This customization enables effective training even on smaller datasets, ensuring improved representation learning and model generalization.
Local-Global (LoGo) Training Strategy: The LoGo strategy integrates both global and local context into the training pipeline. The global branch of the network processes the entire image, capturing overarching contextual features, while the local branch focuses on fine details by operating on image patches. This dual approach ensures a thorough and granular understanding of the image, which is crucial for precise segmentation tasks.

Experimental Validation

The proposed MedT architecture was evaluated on three distinct medical image segmentation datasets: Brain Anatomy Segmentation (ultrasound), Gland Segmentation (microscopic), and MoNuSeg (microscopic). The datasets were chosen to cover a diverse range of medical imaging modalities and segmentation challenges.

The paper reports that MedT consistently outperformed both convolutional and existing transformer-based models. Specifically, MedT demonstrated an improvement in F1 scores over fully convolutional baselines (U-Net, Res-UNet) and Axial Attention U-Net as follows:

Brain US Dataset: Improvement by 0.92% over Axial Attention U-Net and 1.32% over Res-UNet.
GlaS Dataset: Improvement by 4.76% over Axial Attention U-Net and 2.19% over Res-UNet.
MoNuSeg Dataset: Improvement by 2.72% over Axial Attention U-Net and marginal improvement over Res-UNet.

Implications and Future Directions

The implications of this research are significant for both the theoretical understanding and practical deployment of transformers in medical imaging:

Theoretical Advancement: By demonstrating the efficacy of transformer-based architectures in the medical domain, the paper extends the applicability of attention mechanisms beyond their prevailing use-cases in natural language processing and general computer vision.
Practical Impact: The ability to train effective segmentation models on limited data without the need for pre-training on large datasets makes MedT particularly valuable in medical contexts where annotated data is often scarce and the labeling process costly.

Future developments might focus on optimizing the computational efficiency of these models, investigating their robustness across a wider array of medical imaging modalities, and further enhancing their interpretability for clinical usage. Moreover, expanding the scope to multi-modal inputs and cross-domain transfer learning could also be explored.

In conclusion, the "Medical Transformer" paper lays critical groundwork for integrating advanced transformer-based models into medical image analysis. Through the gated axial-attention mechanism and the LoGo training strategy, it provides a robust framework capable of achieving superior segmentation performance even in data-constrained scenarios. This work is a significant step forward in applying state-of-the-art machine learning methodologies to medical image processing, with substantial potential for improving clinical outcomes.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Jeya Maria Jose Valanarasu (31 papers)
Poojan Oza (16 papers)
Ilker Hacihaliloglu (38 papers)
Vishal M. Patel (230 papers)

Citations (868)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - jeya-maria-jose/Medical-Transformer: Official Pytorch Code for "Medical Transformer: Gated Axial-Attention for Medical Image Segmentation" - MICCAI 2021 (779 stars)