- The paper demonstrates that foundation models UNI and CONCH yield strong baseline grading performance, with UNI achieving a kappa of 0.888 on the full PANDA dataset.
- It finds that both models suffer significant performance drops under distribution shifts, highlighting sensitivity to variations in imaging across different medical sites.
- The study emphasizes that even minimal inclusion of diverse training data greatly enhances model generalizability, underlining the need for robust dataset curation in computational pathology.
Evaluating Computational Pathology Foundation Models for Prostate Cancer Grading Under Distribution Shifts
The paper "Evaluating Computational Pathology Foundation Models for Prostate Cancer Grading under Distribution Shifts" examines the robustness of foundational models in computational pathology, focusing specifically on prostate cancer grading. This paper evaluates the performance of two foundation models, UNI and CONCH, as feature extractors under various distribution shifts seen in real-world pathology.
Summary of the Problem
Foundation models are designed as versatile feature extractors and have gained attention across different domains, including computational pathology. While these models are trained on extensive datasets, their robustness in practical scenarios, particularly under distribution shifts, remains uncertain. Distribution shifts may occur due to variations in imaging processes across different medical institutions, impacting the reliability of downstream tasks such as cancer grading.
Methodology
Two foundation models were selected for evaluation: UNI, trained on over 100,000 whole-slide images (WSIs), and CONCH, trained on more than 1.1 million image-caption pairs. These models were leveraged as frozen feature extractors for histological grading models of prostate cancer, using the PANDA dataset. The research employed three different ISUP (International Society of Urological Pathology) grade classification models: ABMIL, Mean Feature, and kNN. Experiments were conducted under conditions simulating common distribution shifts—primarily, shifts in image data due to variations in scanners and shifts in label distributions.
Key Findings and Numerical Results
- Performance Relative to Resnet-IN: UNI and CONCH demonstrated superior performance relative to Resnet-IN, validating their pathology-specific training. For instance, on the full PANDA dataset, UNI achieved a kappa score of 0.888 with ABMIL.
- Sensitivity to Distribution Shifts: Despite the large training datasets, both foundation models showed significant performance degradation when tested on data from a different site than they were trained on, indicating sensitivity to distribution shifts.
- Comparison of Models: UNI consistently outperformed CONCH and Resnet-IN, suggesting its robustness despite the challenging distribution shifts. For Radboud to Karolinska data, the kappa score for UNI with ABMIL dropped to 0.247, highlighting the challenges inherent in handling such shifts.
- Impact of Training Data Diversity: Adding even a small amount of diverse training data significantly improved model generalizability, suggesting that diversity in training data is crucial for mitigating distribution shift effects.
Theoretical and Practical Implications
Practically, the findings expose critical weaknesses in the current paradigm of using foundation models in pathology, as robustness to real-world variability and shifts is not guaranteed by large, varied datasets alone. Theoretically, this paper underscores the necessity for developing stronger dataset curation strategies and architectural improvements that enhance model robustness.
Future Directions
Future research may focus on broadening these evaluations to include other emergent foundation models and investigating the impact of more diverse data augmentation techniques. Studies could also explore systematic methods to quantify and incorporate variability, potentially enhancing model generalizability across diverse medical imaging datasets.
Conclusion
This paper provides a comprehensive evaluation of UNI and CONCH models, contributing valuable insights into the robustness of foundation models in computational pathology. Despite demonstrating strong relative performance in baseline comparisons, these models reveal substantial vulnerabilities under distribution shifts, prompting the need for targeted efforts to address these challenges in the field.