A Multimodal Knowledge-enhanced Whole-slide Pathology Foundation Model (2407.15362v2)
Abstract: Remarkable strides in computational pathology have been made in the task-agnostic foundation model that advances the performance of a wide array of downstream clinical tasks. Despite the promising performance, there are still several challenges. First, prior works have resorted to either vision-only or vision-captions data, disregarding invaluable pathology reports and gene expression profiles which respectively offer distinct knowledge for versatile clinical applications. Second, the current progress in pathology FMs predominantly concentrates on the patch level, where the restricted context of patch-level pretraining fails to capture whole-slide patterns. Here we curated the largest multimodal dataset consisting of H&E diagnostic whole slide images and their associated pathology reports and RNA-Seq data, resulting in 26,169 slide-level modality pairs from 10,275 patients across 32 cancer types. To leverage these data for CPath, we propose a novel whole-slide pretraining paradigm which injects multimodal knowledge at the whole-slide context into the pathology FM, called Multimodal Self-TAught PRetraining (mSTAR). The proposed paradigm revolutionizes the workflow of pretraining for CPath, which enables the pathology FM to acquire the whole-slide context. To our knowledge, this is the first attempt to incorporate multimodal knowledge at the slide level for enhancing pathology FMs, expanding the modelling context from unimodal to multimodal knowledge and from patch-level to slide-level. To systematically evaluate the capabilities of mSTAR, extensive experiments including slide-level unimodal and multimodal applications, are conducted across 7 diverse types of tasks on 43 subtasks, resulting in the largest spectrum of downstream tasks. The average performance in various slide-level applications consistently demonstrates significant performance enhancements for mSTAR compared to SOTA FMs.
- Janez Demšar. Statistical comparisons of classifiers over multiple data sets. The Journal of Machine learning research, 7:1–30, 2006.
- A visual–language foundation model for pathology image analysis using medical twitter. Nature medicine, 29(9):2307–2316, 2023.
- Towards a general-purpose foundation model for computational pathology. Nature Medicine, 30(3):850–862, 2024.
- A visual-language foundation model for computational pathology. Nature Medicine, 30(3):863–874, 2024.
- A whole-slide foundation model for digital pathology from real-world data. Nature, pages 1–8, 2024.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
- Maple: Multi-modal prompt learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19113–19122, 2023.
- Multimodal co-attention transformer for survival prediction in gigapixel whole slide images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4015–4025, 2021.
- Multimodal optimal transport-based co-attention transformer with global structure consistency for survival prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 21241–21251, 2023.
- Cross-modal translation and alignment for survival analysis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 21485–21494, 2023.
- Attention-based deep multiple instance learning. In International conference on machine learning, pages 2127–2136. PMLR, 2018.
- Data-efficient and weakly supervised computational pathology on whole-slide images. Nature Biomedical Engineering, 5(6):555–570, 2021.
- Transmil: Transformer based correlated multiple instance learning for whole slide image classification. Advances in neural information processing systems, 34:2136–2147, 2021.
- Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer. Jama, 318(22):2199–2210, 2017.
- From detection of individual metastases to classification of lymph node status at the patient level: the camelyon17 challenge. IEEE transactions on medical imaging, 38(2):550–560, 2018.
- Artificial intelligence for diagnosis and gleason grading of prostate cancer: the panda challenge. Nature medicine, 28(1):154–163, 2022.
- Machine learning-driven histotype diagnosis of ovarian carcinoma: Insights from the ocean ai challenge. medRxiv, pages 2024–04, 2024.
- Deep learning-based histotype diagnosis of ovarian carcinoma whole-slide pathology images. Modern Pathology, 35(12):1983–1990, 2022.
- The cancer genome atlas pan-cancer analysis project. Nature genetics, 45(10):1113–1120, 2013.
- Development and evaluation of a deep neural network for histologic classification of renal cell carcinoma on biopsy and surgical resection slides. Scientific reports, 11(1):7080, 2021.
- Predicting axillary lymph node metastasis in early breast cancer using deep learning on primary tumor biopsy slides. Frontiers in Oncology, page 4133, 2021.
- Identity mappings in deep residual networks. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pages 630–645. Springer, 2016.
- Pan-cancer integrative histology-genomic analysis via multimodal deep learning. Cancer Cell, 40(8):865–878, 2022.
- Visual language pretrained multiple instance zero-shot transfer for histopathology images. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19764–19775, 2023.
- Development of a rna-seq based prognostic signature in lung adenocarcinoma. JNCI: Journal of the National Cancer Institute, 109(1):djw200, 2017.
- Protein-coding genes combined with long noncoding rna as a novel transcriptome molecular staging model to predict the survival of patients with esophageal squamous cell carcinoma. Cancer communications, 38:1–13, 2018.
- Convolutional neural network models for cancer type prediction based on gene expression. BMC medical genomics, 13:1–13, 2020.
- Determining breast cancer histological grade from rna-sequencing data. Breast Cancer Research, 18:1–13, 2016.
- Improved breast cancer histological grading using deep learning. Annals of Oncology, 33(1):89–98, 2022.
- Virchow: A million-slide digital pathology foundation model. arXiv preprint arXiv:2309.07778, 2023.
- Rna sequencing: new technologies and applications in cancer research. Journal of hematology & oncology, 13:1–16, 2020.
- Gene-expression profiles predict survival of patients with lung adenocarcinoma. Nature medicine, 8(8):816–824, 2002.
- Histgen: Histopathology report generation via local-global feature encoding and cross-modal context interaction. arXiv preprint arXiv:2403.05396, 2024.
- Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023.
- Longnet: Scaling transformers to 1,000,000,000 tokens. arXiv preprint arXiv:2307.02486, 2023.
- Mambamil: Enhancing long sequence modeling with sequence reordering in computational pathology. arXiv preprint arXiv:2403.06800, 2024.
- Bo Li and Colin N Dewey. Rsem: accurate transcript quantification from rna-seq data with or without a reference genome. BMC bioinformatics, 12:1–16, 2011.
- Separating measurement and expression models clarifies confusion in single-cell rna sequencing analysis. Nature genetics, 53(6):770–777, 2021.
- A practical guide to single-cell rna-sequencing for biomedical research and clinical applications. Genome medicine, 9:1–12, 2017.
- scbert as a large-scale pretrained deep language model for cell type annotation of single-cell rna-seq data. Nature Machine Intelligence, 4(10):852–866, 2022.
- Gene2vec: distributed representation of genes based on co-expression. BMC genomics, 20:7–15, 2019.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4):1234–1240, 2020.
- scgpt: toward building a foundation model for single-cell multi-omics using generative ai. Nature Methods, pages 1–11, 2024.
- In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737, 2017.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
- Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022.
- Bias in cross-entropy-based training of deep survival networks. IEEE transactions on pattern analysis and machine intelligence, 43(9):3126–3137, 2020.
- The molecular signatures database hallmark gene set collection. Cell systems, 1(6):417–425, 2015.
- Simpleshot: Revisiting nearest-neighbor classification for few-shot learning. arXiv preprint arXiv:1911.04623, 2019.
- An introduction to the bootstrap. Chapman and Hall/CRC, 1994.
- Frank Wilcoxon. Individual comparisons by ranking methods. In Breakthroughs in statistics: Methodology and distribution, pages 196–202. Springer, 1992.
- Molecular analysis of tcga breast cancer histologic types. Cell genomics, 1(3), 2021.
- The consensus molecular subtypes of colorectal cancer. Nature medicine, 21(11):1350–1356, 2015.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002.
- Meteor 1.3: Automatic metric for reliable optimization and evaluation of machine translation systems. In Proceedings of the sixth workshop on statistical machine translation, pages 85–91, 2011.
- Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81, 2004.