From Modern CNNs to Vision Transformers: Assessing the Performance, Robustness, and Classification Strategies of Deep Learning Models in Histopathology (2204.05044v2)

Published 11 Apr 2022 in eess.IV, cs.LG, and stat.ML

Abstract: While machine learning is currently transforming the field of histopathology, the domain lacks a comprehensive evaluation of state-of-the-art models based on essential but complementary quality requirements beyond a mere classification accuracy. In order to fill this gap, we developed a new methodology to extensively evaluate a wide range of classification models, including recent vision transformers, and convolutional neural networks such as: ConvNeXt, ResNet (BiT), Inception, ViT and Swin transformer, with and without supervised or self-supervised pretraining. We thoroughly tested the models on five widely used histopathology datasets containing whole slide images of breast, gastric, and colorectal cancer and developed a novel approach using an image-to-image translation model to assess the robustness of a cancer classification model against stain variations. Further, we extended existing interpretability methods to previously unstudied models and systematically reveal insights of the models' classifications strategies that can be transferred to future model architectures.

Citations (29)

View on Semantic Scholar

Summary

The paper evaluates modern CNN and Vision Transformer models using a comprehensive framework assessing predictive performance, robustness against stain variations, and interpretability in histopathology.
Complex pretrained models like ConvNeXt-L show high performance, but transfer learning and self-supervised pretraining are critical, especially for transformer variants and with limited data.
Models primarily focus on nuclei but show significant performance degradation under cross-distribution and cross-stain conditions, indicating vulnerability to realistic domain shifts.

The paper presents a comprehensive evaluation framework for deep learning architectures in histopathology, comparing modern convolutional neural networks (CNNs) and vision transformer (ViT)–based models with respect to predictive performance, robustness against staining variations, and interpretability. The paper benchmarks a diverse set of models—including ResNet variants (BiT), ConvNeXt (both tiny and large), Inception V3, ViT (tiny and large), Swin transformers, BoTNet50, and a hybrid model (GasHis combining Inception V3 and BoTNet50)—across five publicly available datasets covering breast, gastric, and colorectal cancers.

The evaluation framework is structured around three principal axes:

Predictive Performance:
- Models are trained on datasets such as PCam, BreaKHis (×40), IDC, GasHisSDB, and MHIST with rigorous model selection based on the area under the receiver operating characteristic curve (AUC) and accuracy.
- Extensive experiments using k = 5 independent runs per model quantify both algorithmic (training process) and statistical uncertainties (via bootstrapping).
- On the PCam dataset, for instance, ConvNeXt-L achieved an accuracy of approximately 90.31% with an AUC of 0.9722, while Inception V3 and the GasHis hybrid were also consistently competitive.
- A notable set of experiments contrasted models trained with pretrained weights from large-scale datasets (e.g., ImageNet-21k/22k) against models trained from scratch. Complex architectures—especially transformer variants and ConvNeXt—demonstrated substantial performance drop when training data were limited, thereby underscoring the importance of transfer learning on histopathology tasks.
- The paper also incorporates self-supervised pretraining on histopathology data, reporting significant performance gains for models such as RetCCL (ResNet‑50) and a self-supervised ResNet‑18, though these improvements appear to be dataset-dependent.
Robustness Evaluation:
- To decouple the effect of staining variations from other distributional shifts, a Cycle‑consistent adversarial network (CycleGAN) is employed to perform unpaired image‐to‐image translation between datasets (e.g., recoloring IDC images to mimic BreaKHis staining and vice versa).
- The robustness tests are organized into four conditions:
- 1. In-distribution, in-stain: Standard test set evaluation with original stain.
- 2. In-distribution, recolored: Test images are recolored using the CycleGAN while remaining in-distribution.
- 3. Cross-distribution, cross-stain: Test images originate from a different lab, reflecting both distribution and staining differences.
- 4. Cross-distribution, recolored: Cross-distribution images are recolored to mimic the staining of the training set.
- Results indicate that while models perform best on in-distribution, in-stain images, even the recolored but in-distribution condition shows a performance drop of 1.8% to 5.7% (or larger—up to 18.8%—depending on the source and target transformations). Cross-distribution evaluations result in AUC decreases on the order of 20–40%, highlighting the vulnerability of these models to domain shifts beyond mere global hue transformations.
- The CycleGAN generators, despite successfully shifting the mean and variance of the hue distributions, exhibit limitations in fully reproducing the target stain distribution, a factor that is directly linked to varying degrees of performance degradation.
Interpretability Analysis:
- To elucidate the decision-making processes of these deep models, the paper applies layer-wise relevance propagation (LRP) to generate attribution heatmaps. These maps are quantitatively compared with segmentation masks delineating cell nuclei—obtained from an nnU-Net trained on the MoNuSeg dataset—as well as tissue and background regions.
- The quantitative metrics include mass accuracy (the fraction of total relevance overlapping with a segmentation mask) and Pearson correlation coefficients between the relevance maps and the segmentation maps. Across all models, a significant fraction of the attributed relevance is localized to cell nuclei, which aligns with the clinical emphasis on nuclear morphology for cancer diagnosis.
- Although models like Inception V3 and GasHis exhibit slightly stronger Pearson correlations with nuclei segments, even the best-performing ConvNeXt-L does not always yield the highest mass accuracy. This suggests that high predictive performance may be achieved through different integration strategies of local nuclear features.
- For attention-based architectures such as ViT, the paper further analyzes attention maps from the classification token. These maps—when re-arranged spatially—reveal that many attention heads highly concentrate on the nuclei, with a subset also attending to background features. Detailed per-head analysis (over 16 heads for ViT-L and 3 heads for ViT-T) shows varying degrees of correlation with nuclei, tissue, and background, providing useful insights into the thematic specialization of individual self-attention blocks.

In summary, the paper establishes that:

Model Architecture Selection:
- Lightweight models such as Inception V3 (which perform well even when trained from scratch) and more complex, pretrained models such as ConvNeXt-L (which deliver further performance gains) are recommended for histopathology image analysis.
Importance of Pretraining and Data Augmentation:
- Transfer learning and self-supervised pretraining are critical for achieving high performance, particularly for models with intrinsically low inductive biases (e.g., transformer architectures). Traditional global color transformations used during augmentation may not suffice to counteract the effects of realistic stain variations.
Need for Enhanced Robustness Strategies:
- The large performance degradation observed under cross-distribution and cross-stain conditions indicates that current architectures, despite excellent in-distribution performance, remain fragile when faced with realistic domain shifts. This underlines the necessity for further research in domain adaptation and robust learning in histopathology.
Interpretability in Clinical Context:
- While all models predominantly focus on nuclei, there are subtle yet measurable differences in the allocation of relevance that may hint at distinct classification strategies. This observation opens avenues for model ensemble formation and more fine-grained analyses aimed at achieving both accuracy and interpretability.

Overall, the work provides an extensive methodological framework that incorporates both quantitative performance evaluation and interpretability assessment, serving as a foundation for future research on robust and explainable deep learning applications in computational histopathology.

PDF Markdown

From Modern CNNs to Vision Transformers: Assessing the Performance, Robustness, and Classification Strategies of Deep Learning Models in Histopathology (2204.05044v2)

Summary

Related Papers