Zero-Shot Disease Classification
- Zero-shot disease classification is a clinical AI method that diagnoses diseases without prior labeled examples by leveraging semantic and multimodal representations.
- It utilizes dual encoder architectures, prompt engineering, and generative feature synthesis to bridge gaps between seen and unseen disease categories.
- Evaluation metrics like AUROC and F1 assess performance while addressing challenges such as class imbalance, calibration, and domain shifts in medical imaging.
Zero-shot disease classification is a subfield of clinical artificial intelligence in which a system identifies, classifies, or diagnoses diseases for which it has never seen labeled training examples. Addressing this problem is critical for medical imaging, electronic health records, and other biomedical applications, where the label space is vast and real-world disease distributions are highly long-tailed. Traditional supervised models are inherently limited in their generalization: they can only predict classes present in the training data, whereas zero-shot learning (ZSL) approaches leverage auxiliary semantic knowledge, modality alignment, or generative mechanisms to enable inference on previously unseen disease categories.
1. Conceptual Foundations and Task Setting
Zero-shot disease classification seeks to bridge the gap between supervised deep learning and the open-world challenge of unknown or rare disease entities. Given an input (e.g., a chest X-ray, pathology WSI, clinical text), a zero-shot classifier is required to predict the most relevant disease label(s) from an expanded label set, which includes both "seen" classes (those present in training) and "unseen" classes (those reserved for zero-shot evaluation) (Hayat et al., 2021, Lin et al., 9 Jun 2025).
Key ingredients enabling zero-shot performance include semantic representations of disease classes (e.g., vector embeddings derived from medical text corpora by BioBERT, GPT-4, or curated clinical attributes), architectural designs for cross-modality alignment (e.g., visual features with semantic features), and learning objectives that encourage a model to generalize beyond the closed set. Formal evaluation typically distinguishes between "conventional ZSL" (test only on unseen classes) and the more clinically realistic "generalized ZSL" (test on both seen and unseen with separation quantified by harmonic mean metrics).
2. Architectures and Methodological Innovations
A variety of architectures underpin state-of-the-art zero-shot disease classification systems across modalities.
Vision-Language Alignment: The dominant paradigm for image-based ZSL uses dual encoders to map images and disease descriptions into a shared latent space, with predictions made by measuring similarity (e.g., cosine) between visual and textual embeddings. Notable instantiations include:
- CXR-ML-GZSL maps DenseNet-121 visual features and BioBERT disease embeddings into a 128-dimensional joint space, training with ranking, alignment, and semantic-consistency losses (Hayat et al., 2021).
- CLIP-based frameworks employ large-scale pre-trained vision(transformer)-language(transformer) backbones and zero-shot prompt engineering to match images to prompt-encoded diseases (Benabbas et al., 24 Nov 2025, Liu et al., 2023).
- CARZero enhances image-text alignment by replacing pooled vector similarity with cross-attention between local image patches and word/sentence-level text features, improving fine-grained matching for rare diseases (Lai et al., 2024).
Prompt Engineering: Effective performance depends on constructing semantically rich, clinically relevant prompts for disease labels. Empirical studies demonstrate that LLM-generated or human-curated prompts that encapsulate specific visual/semantic features outperform naive class names, especially for rare diseases or in long-tailed distributions (Liu et al., 2023, Lin et al., 9 Jun 2025).
Patch-based & Hybrid Models: Histopathology and volumetric imaging modalities require multi-resolution and multi-instance approaches (e.g., hybrid fusion of global and local features, attention-weighted patch embeddings, montage construction) to capture diagnostic context critical for zero-shot generalization (Rahaman et al., 13 Mar 2025, Uden et al., 2023).
Generative Feature Synthesis: Some frameworks address the inability to observe unseen-class samples by synthesizing latent features conditioned on semantic descriptors, either via conditional Wasserstein GANs with auxiliary losses (attribute consistency, hierarchy, or keyword reconstruction) or adversarial learning (Song et al., 2019, Mahapatra, 2022).
Similarity Retrieval and Clustering: Hybrid systems (e.g., RURA-Net) use Siamese networks for disease similarity retrieval, lesion segmentation with U-Nets, and unsupervised clustering of deep features for pseudo-label assignment, achieving ZSL in the absence of explicit class supervision (su et al., 26 Feb 2025).
3. Loss Functions, Training, and Latent Space Regularization
Zero-shot generalization is sensitive to the geometry and structure of the learned latent space. Key loss and training constructs include:
- Ranking Losses (): Ensure positive disease classes receive higher scores than negatives for multi-label settings (Hayat et al., 2021).
- Alignment and Consistency Losses: Encourage image embeddings to be closely aligned (cosine) with semantic class prototypes; force projected class embeddings to preserve their relative structure (Hayat et al., 2021, Lin et al., 9 Jun 2025).
- Contrastive Losses (InfoNCE): Jointly maximize similarity between matched image and text pairs while minimizing for mismatched pairs (Benabbas et al., 24 Nov 2025, Uden et al., 2023).
- Class-weighting and Clustering: Gaussian Mixture Model and Student’s t-distribution clustering, followed by triplet loss and class-weighted objectives, considerably improve performance for rare, long-tailed classes (Madhipati et al., 25 Jul 2025).
- Generative/Adversarial Losses: Conditional WGAN-GP, cycle consistency, and keyword-based reconstruction for synthesizing unseen-class representations (Song et al., 2019, Mahapatra, 2022).
Empirically, inclusion of alignment and semantic consistency (beyond vanilla ranking/contrastive objectives) robustly boosts unseen-class recall without sacrificing performance on seen classes (Hayat et al., 2021, Madhipati et al., 25 Jul 2025). Fine-tuning both visual and text encoders or leveraging domain-adaptive pretraining can further improve domain transfer for scarce or emerging diseases (Uden et al., 2023).
4. Evaluation Protocols and Benchmark Results
Evaluation protocols for zero-shot disease classification are built to measure both overall classification accuracy and the model’s ability to maintain sensitivity on rare/unseen disease classes. Commonly reported metrics include:
- Recall@k, Precision@k, F1@k for top-k multi-label tasks (Hayat et al., 2021, Lin et al., 9 Jun 2025).
- Area Under the Receiver Operating Characteristic Curve (AUROC), reported separately for seen, unseen, and as the harmonic mean (Hayat et al., 2021, Lai et al., 2024, Madhipati et al., 25 Jul 2025).
- Macro-Averaged Precision/Recall/F1 for imbalanced, multi-class problems (Benabbas et al., 24 Nov 2025, Madhipati et al., 25 Jul 2025).
- Calibration Errors and robust error analysis on long-tailed benchmarks (Lin et al., 9 Jun 2025).
Experimental results highlight that vision-language ZSL models typically achieve substantial gains over both naive baselines and few-shot approaches on rare classes, although the absolute mAP or AUROC on strictly "zero-shot" test classes remains lower than for seen cases (e.g., CXR-LT 2024 Task 3, mAP of 0.129–0.13 vs long-tail supervised baseline at 0.136) (Lin et al., 9 Jun 2025). Across diverse domains (chest X-ray, fundus, pathology, plant disease, EHRs), techniques that attentionally align or cluster embeddings, or that leverage multi-cue prompt aggregation, consistently achieve state-of-the-art zero-shot classification, especially under domain shift (Rahaman et al., 13 Mar 2025, Benabbas et al., 24 Nov 2025, Liu et al., 2023).
5. Applications, Impact, and Extensibility
Zero-shot disease classification is applicable in multiple healthcare contexts:
- Medical imaging triage and discovery: Rapid adaptation to emergent diseases, rare condition detection, and open-world triage without retraining (Lin et al., 9 Jun 2025, Rahman et al., 2024).
- Computational pathology: Multi-resolution prompt-guided ZSL approaches enable histological subtyping and tumor vs. benign discrimination without manual annotations (Rahaman et al., 13 Mar 2025).
- Electronic health records (EHR) phenotyping: Frameworks such as LLM-based MapReduce pipelines support cohort discovery and rare disease case finding, outperforming hand-crafted rules (Thompson et al., 2023).
- Plant and agricultural diagnostics: CLIP-based ZSL closes the gap between curated datasets and field deployment for plant disease classification under domain shift (Benabbas et al., 24 Nov 2025).
- Clinical text and code assignment: Generalized ZSL generators for ICD or MeSH coding enhance recognition of rare or never-before-annotated diagnosis codes (Song et al., 2019, Lupart et al., 2022).
Extensibility to multi-modal settings (e.g., combining imaging, text reports, and structured labs), as well as refinement for interpretability through attention or saliency mapping, has been demonstrated for clinical robustness and transparency (Liu et al., 2023, Lai et al., 2024).
6. Limitations, Challenges, and Open Research Directions
Despite progress, challenges in zero-shot disease classification persist:
- Extreme class imbalance and calibration: Extremely low prevalence of some diseases leads to miscalibration and high false positive rates, especially where positive samples are rare even at test time (Lin et al., 9 Jun 2025, Madhipati et al., 25 Jul 2025).
- Semantic and visual domain shift: Variability in disease descriptions, synonym usage, and differing imaging artifacts can impact robustness; crafting generalized prompts or learning robust mappings remains nontrivial (Lin et al., 9 Jun 2025).
- Interpretability and reliability: Although attention or feature alignment can support explainability, wide adoption in clinical workflows requires further validation and integration with radiologist review or human-in-the-loop systems (Liu et al., 2023, Lai et al., 2024).
- Scalability and annotation efficiency: Many methods (e.g., RURA-Net, generative ZSL) still require segmentation datasets, curated prompts, or keyword mining, which can introduce additional annotation or domain adaptation burdens (su et al., 26 Feb 2025, Song et al., 2019).
- Generalization to new organs and modalities: Cross-disease transferability has been demonstrated primarily within similar organs (e.g., lung X-rays); robust generalization to other organs, multi-class tasks, or non-imaging data remains a frontier (Rahman et al., 2024).
Recommendations from leading challenges and ablation studies suggest incorporating external knowledge graphs, region-proposal supervision, adaptive per-class decision thresholds, and integration of dense retrieval or chain-of-thought LLM prompting strategies to further elevate ZSL performance.
7. Representative Frameworks and Comparative Analysis
| Framework | Modality | Core Mechanism | Highlighted Metric | Unseen Class Performance |
|---|---|---|---|---|
| CXR-ML-GZSL (Hayat et al., 2021) | Chest X-ray | Visual-semantic joint latent mapping | AUROC (U=0.66, H=0.72) | +22% AUROC over baseline |
| CARZero (Lai et al., 2024) | Chest X-ray | Cross-attention alignment | AUC on rare diseases (+0.10) | 0.837 on PadChest20 |
| MR-PHE (Rahaman et al., 13 Mar 2025) | Histopathology | Multi-res patch, hybrid fusion, prompt enrichment | Delta F1 = +2–17 pts | Outperforms fully sup. baseline |
| CXR-CML (Madhipati et al., 25 Jul 2025) | Chest X-ray | Weighted contrastive loss, GMM+t clustering | AUC=0.720 (rare) | +0.089 over CheXzero |
| RURA-Net (su et al., 26 Feb 2025) | Ophthalmic (CFP) | Siamese retrieval, U-Net, clustering | F1=0.83, AUC=0.92 | Beats most few-shot/one-shot |
| Plant CLIP (Benabbas et al., 24 Nov 2025) | Plant leaf | CLIP zero-shot, symptom prompt | Macro F1=66.3% | Robust to field domain shift |
| EHR LLM-RAG (Thompson et al., 2023) | Clinical text (EHR) | Retrieval-Augmented Generation, LLM MapReduce | F1=0.75 (zero-shot PH) | +21% over rule-based |
Each method demonstrates that explicitly leveraging semantic alignment, clustering, domain-adaptive pretraining, or prompt engineering is critical to achieving clinically meaningful zero-shot disease classification performance while highlighting the remaining challenges for open-world diagnostic systems.