Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Benchmarking foundation models as feature extractors for weakly-supervised computational pathology (2408.15823v2)

Published 28 Aug 2024 in eess.IV and cs.CV

Abstract: Advancements in artificial intelligence have driven the development of numerous pathology foundation models capable of extracting clinically relevant information. However, there is currently limited literature independently evaluating these foundation models on truly external cohorts and clinically-relevant tasks to uncover adjustments for future improvements. In this study, we benchmarked 19 histopathology foundation models on 13 patient cohorts with 6,818 patients and 9,528 slides from lung, colorectal, gastric, and breast cancers. The models were evaluated on weakly-supervised tasks related to biomarkers, morphological properties, and prognostic outcomes. We show that a vision-language foundation model, CONCH, yielded the highest performance when compared to vision-only foundation models, with Virchow2 as close second. The experiments reveal that foundation models trained on distinct cohorts learn complementary features to predict the same label, and can be fused to outperform the current state of the art. An ensemble combining CONCH and Virchow2 predictions outperformed individual models in 55% of tasks, leveraging their complementary strengths in classification scenarios. Moreover, our findings suggest that data diversity outweighs data volume for foundation models. Our work highlights actionable adjustments to improve pathology foundation models.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (16)
  1. Peter Neidlinger (4 papers)
  2. Omar S. M. El Nahhas (10 papers)
  3. Hannah Sophie Muti (1 paper)
  4. Tim Lenz (8 papers)
  5. Michael Hoffmeister (5 papers)
  6. Hermann Brenner (4 papers)
  7. Marko van Treeck (5 papers)
  8. Rupert Langer (3 papers)
  9. Bastian Dislich (2 papers)
  10. Hans Michael Behrens (1 paper)
  11. Christoph Röcken (2 papers)
  12. Sebastian Foersch (7 papers)
  13. Daniel Truhn (51 papers)
  14. Antonio Marra (2 papers)
  15. Oliver Lester Saldanha (2 papers)
  16. Jakob Nikolas Kather (34 papers)
Citations (4)

Summary

Benchmarking Foundation Models as Feature Extractors for Weakly-Supervised Computational Pathology

The paper "Benchmarking Foundation Models as Feature Extractors for Weakly-Supervised Computational Pathology" undertakes a comprehensive evaluation of ten histopathology foundation models across extensive and diverse datasets. This work aims to assess the robustness and performance of these models on clinically relevant tasks in digital pathology (DP).

Experimental Setup

The authors benchmarked ten foundation models across various histopathology tasks using 13 patient cohorts encompassing 6,791 patients and 9,493 whole-slide images (WSIs) from lung, colorectal, gastric, and breast cancers. These models were tested on weakly-supervised tasks related to biomarkers, morphological properties, and prognostic outcomes. This benchmarking effort is the most extensive to date and mitigates potential biases from data leakage by ensuring that test datasets were not part of the model pretraining.

Key Findings

Model Performance

  • Vision-LLMs vs Vision-Only Models: The vision-LLM CONCH showed higher performance across several tasks compared to vision-only models. CONCH yielded the highest Area Under the Receiver Operating Characteristic (AUROC) in 42% of tasks, and in categories including morphology, biomarkers, and prognostication.
  • Ensemble Models: An ensemble of complementary foundation models outperformed any single model, with ensembles outperforming CONCH in 66% of tasks. This finding highlights the potential of ensembles in combining different foundation models to surpass the current state-of-the-art.
  • Data Diversity vs Data Quantity: The paper found that data diversity outweighed data quantity for model performance. Models pretrained on smaller but diverse datasets performed better on downstream tasks than those pretrained on larger but less diverse datasets.

Numerical Results

  • Performance Metrics: CONCH achieved an AUROC of 0.71 averaged across all tasks, followed by UNI, H-optimus-0, and Prov-GigaPath with AUROCs of 0.69. For specific task domains, the vision-LLM outperformed the vision-only models, scoring an AUROC of 0.77 for morphology, 0.73 for biomarkers, and 0.62 for prognostication.
  • Task Difficulty: In high-performance tasks where at least one model achieved a mean AUROC over 0.7 with an SD below 0.05, three models (CONCH, UNI, H-optimus-0) yielded a 0.79 AUROC on average. For low-performance tasks, CONCH performed best on eight of 15 tasks, with an average AUROC of 0.63.

Model Interpretability

  • The paper included an attention heatmap analysis to compare model behaviors. It revealed varying foci on tissue regions across models, suggesting differential prioritization of morphological features. Differences in attention among models supported the efficacy of ensemble approaches, as these would integrate varied perspectives leading to enhanced overall predictive accuracy.

Implications and Future Directions

Practical Implications

The paper's findings suggest that implementing ensemble models could substantially improve predictive performance in clinical settings within DP. Additionally, the utility of vision-LLMs like CONCH underscores the importance of multimodal pretraining for tasks traditionally approached with unimodal data, such as WSIs.

Theoretical Implications

The superiority of data diversity over sheer volume has significant implications for the design of future foundation models in computational pathology. Training on diverse datasets enhances model robustness and generalizability, potentially leading to better real-world applicability in varied clinical scenarios.

Future Developments

Future research should explore the integration of advanced ensemble techniques and dimensionality reduction methods to optimize model combinations without the drawbacks of increased feature space. Expanding the benchmarking to include more cancer types and additional histopathology tasks beyond classification, such as regression or segmentation, would further validate the generalizability of these models.

Conclusion

This paper provides a meticulous benchmarking of histopathology foundation models in DP, elucidating the strengths of vision-LLMs, the efficacy of ensemble approaches, and the critical role of data diversity. These findings offer valuable insights for the future development and implementation of advanced AI-driven diagnostic tools in clinical pathology.

Youtube Logo Streamline Icon: https://streamlinehq.com