Visual Language Pretrained Multiple Instance Zero-Shot Transfer for Histopathology Images (2306.07831v1)

Published 13 Jun 2023 in cs.CV

Abstract: Contrastive visual language pretraining has emerged as a powerful method for either training new language-aware image encoders or augmenting existing pretrained models with zero-shot visual recognition capabilities. However, existing works typically train on large datasets of image-text pairs and have been designed to perform downstream tasks involving only small to medium sized-images, neither of which are applicable to the emerging field of computational pathology where there are limited publicly available paired image-text datasets and each image can span up to 100,000 x 100,000 pixels. In this paper we present MI-Zero, a simple and intuitive framework for unleashing the zero-shot transfer capabilities of contrastively aligned image and text models on gigapixel histopathology whole slide images, enabling multiple downstream diagnostic tasks to be carried out by pretrained encoders without requiring any additional labels. MI-Zero reformulates zero-shot transfer under the framework of multiple instance learning to overcome the computational challenge of inference on extremely large images. We used over 550k pathology reports and other available in-domain text corpora to pre-train our text encoder. By effectively leveraging strong pre-trained encoders, our best model pretrained on over 33k histopathology image-caption pairs achieves an average median zero-shot accuracy of 70.2% across three different real-world cancer subtyping tasks. Our code is available at: https://github.com/mahmoodlab/MI-Zero.

PDF HTML Abstract

Overview of "Visual Language Pretrained Multiple Instance Zero-Shot Transfer for Histopathology Images"

This paper proposes a novel framework, dubbed MI-Zero, to enable zero-shot transfer capabilities in gigapixel histopathology images using contrastive visual-language pretraining. The authors extend the existing methods in computational pathology by introducing a mechanism that can utilize large-scale pretrained image and text encoders, overcoming the challenges of high computational cost and lack of large labeled datasets in this domain. The research is centered around leveraging multiple instance learning (MIL) to perform zero-shot classification on whole slide images (WSIs) in histopathology.

Background and Motivation

In computational pathology (CPATH), developing models that achieve clinical-grade performance typically requires large annotated datasets, which are not always feasible, especially when it comes to specialized fields such as pathology. Traditional methods involve a complex workflow of subdividing WSIs into patches, training classifiers on these patches with corresponding labels, and aggregating patch-level predictions for WSI-level decision-making. Recent advances in self-supervised and weakly-supervised learning have shown promise for localized morphological representation. However, these methods still rely substantially on labeled data.

Contrastive visual-language pretraining, exemplified by models like CLIP, has opened new avenues by aligning image and text representations in a shared latent space, allowing for flexible and robust zero-shot learning. This approach, however, has predominantly targeted natural images rather than pathology images, due to the formidable challenges posed by the latter's massive image sizes and the paucity of large-scale, paired image-text datasets.

Contributions and Methodology

The primary contribution of this work is the development of MI-Zero, which is structured to extend contrastive visual-LLMs to handle the unique challenges associated with WSIs. MI-Zero is built upon several key innovations:

Dataset Expansion: A significant effort was invested in curating a large dataset specifically tailored for pathology. This dataset comprises 33,480 image-caption pairs, surpassing previously available datasets like ARCH.
Pretraining and Alignment: Employing over 550,000 pathology reports and other text corpora, a domain-specific text encoder, HistPathGPT, was pretrained. This was coupled with a visual encoder leveraging self-supervised learning, allowing domain-specific feature extraction from pathology images.
Framework Integration: The MI-Zero approach reformulates zero-shot learning using MIL, accommodating the computation of inference across gigapixel images. It applies permutation invariant pooling methods, such as top-K and average pooling, to handle image patches and achieve WSI-level predictions without labeled data.
Evaluation: The model is rigorously evaluated on several independent in-house datasets and comparably benchmarked against state-of-the-art supervised methods. The paper reports an average median zero-shot accuracy of 70.2% across various cancer subtyping tasks, a significant achievement without the need for labeled data.

Results and Implications

The results indicate that MI-Zero performs competently even against supervised models trained with extensive labeled samples. The paper underscores the potential of zero-shot learning frameworks in pathology, which can aid rapid deployment across different diagnostic tasks without extensive retraining or labeled data acquisition, thus enhancing the scalability of computational tools in medical diagnostics.

Future Directions

This paper opens pathways for integrating more comprehensive visual-language tasks beyond classification, such as visual question answering or automated medical reporting. It also suggests the need for further expansion of pathology-specific datasets and exploration of models that can enhance sample efficiency. From a broader perspective, the lessons from MI-Zero may be extrapolated to other domains that handle high-resolution imagery, suggesting potential applications in fields like satellite imaging and remote sensing.

The authors' integration of MIL with zero-shot learning for handling the unique challenges of WSIs is a commendable advancement in utilizing machine learning to tackle real-world problems in pathology.

PDF Markdown Bookmark Chat (Pro)

Authors (9)

Ming Y. Lu (23 papers)
Bowen Chen (50 papers)
Andrew Zhang (20 papers)
Drew F. K. Williamson (24 papers)
Richard J. Chen (28 papers)
Tong Ding (14 papers)
Long Phi Le (10 papers)
Yung-Sung Chuang (37 papers)
Faisal Mahmood (53 papers)

Citations (67)

View on Semantic Scholar

Visual Language Pretrained Multiple Instance Zero-Shot Transfer for Histopathology Images (2306.07831v1)

Overview of "Visual Language Pretrained Multiple Instance Zero-Shot Transfer for Histopathology Images"

Background and Motivation

Contributions and Methodology

Results and Implications

Future Directions

Related Papers

GitHub

YouTube