ProbMed: Probing Evaluation in Medical Diagnosis
- ProbMed is a rigorous evaluation framework that challenges automated diagnostic models with adversarial examples and multi-step reasoning tasks to expose hidden errors.
- It incorporates adversarial questions and detailed procedural diagnostics to simulate real clinical challenges such as modality, organ, and abnormality identification.
- Empirical results show that even top-performing multimodal models can drop to 35-40% accuracy on fine-grained diagnostic tasks when subjected to ProbMed testing.
Probing Evaluation for Medical Diagnosis (ProbMed) is a rigorous methodology for assessing the performance, robustness, and diagnostic reliability of automated models—particularly large multimodal models—in medical imaging and clinical decision tasks. Unlike conventional benchmark testing, probing evaluation deliberately exposes model brittleness using adversarial constructs, multi-step diagnostic workflows, and dimension-specific metrics to simulate real clinical challenges. This approach clarifies diagnostic specificity and highlights the limitations of even top-performing systems in handling fine-grained, multi-stage medical reasoning.
1. Probing Evaluation Methodology
Probing evaluation systematically interrogates models beyond routine accuracy metrics by introducing adversarial, paired questions and requiring models to distinguish genuine findings from hallucinated (non-existent) features. For each ground-truth (“yes”) diagnostic question (e.g., “Is there evidence of abnormality X in region R?”), an adversarial negated counterpart is generated that references an attribute or abnormality not present in the image. Models are thus directly tested on their ability to reject false positives and recognize fine-grained attributes, a setting that uncovers overconfidence and hallucination errors.
The methodology is not limited to isolated question-answering but includes procedural diagnostics, which require the model to reason correctly across sequential diagnostic dimensions—such as first recognizing image modality and organ before detecting abnormalities and identifying findings with precise anatomical localization. This approach systematically quantifies not only task-level performance but also error propagation in diagnostic reasoning, mirroring clinical workflows (Yan et al., 30 May 2024).
2. Description and Construction of the ProbMed Dataset
The ProbMed dataset consists of 6,303 medical images from existing benchmarks (MedICaT, ChestX-ray14), encompassing X-ray, CT, and MRI modalities and multiple organs (abdomen, brain, chest, spine). For each image, detailed metadata is auto-curated using GPT-4 and a dedicated positional reasoning engine, cataloging condition names, abnormalities, and position descriptors. Using this structured metadata, 57,132 question–answer pairs are generated: each ground-truth question (positive instance) is paired with a corresponding adversarial (negated) question constructed by selecting alternate organs, modalities, or hallucinated attributes.
Questions are organized along five diagnostic dimensions:
Dimension | Example Task | Evaluation Focus |
---|---|---|
Modality Recognition | Distinguish between X-ray and CT | Basic comprehension |
Organ Identification | Identify organ imaged (e.g., chest vs brain) | Anatomical grounding |
Abnormality Detection | Detect presence of specific abnormality | Fine-grained clinical feature |
Condition/Finding | Assign clinical finding/label | Diagnostic specificity |
Positional Reasoning | Localize abnormality within organ | Spatial accuracy |
This structure enables axis-specific analysis of both general and highly specialized reasoning required for clinically valid diagnosis (Yan et al., 30 May 2024).
3. Impact and Results of Probing Evaluation on State-of-the-Art Models
Empirical results on ProbMed demonstrate a stark contrast between conventional benchmark accuracy and performance under probing evaluation. State-of-the-art large multimodal models (LMMs) such as GPT-4o, GPT-4V, and Gemini Pro, while achieving >90% accuracy on general recognition tasks (modality and organ ID), perform worse than random guessing on specialized diagnostic questions when challenged with adversarial pairs. For instance, their accuracy in fine-grained condition/finding identification or positional inference can drop to 35–40% or lower—camouflaging their limitation on typical leaderboards. This performance degradation reflects a pronounced tendency to hallucinate (agreeing with false or irrelevant features) and confuses visual-semantic cues under adversarial perturbations.
Error analyses further show a substantial portion of mistakes stem from failures to reject hallucinated conditions or positions. For example, a model may accurately recognize the imaging modality but accept an adversarial attribute (e.g., reporting a nonexistent pulmonary mass), illustrating poor clinical reliability in high stakes applications (Yan et al., 30 May 2024).
Procedural diagnosis evaluation compounds this effect: models must correctly answer a sequence of dependent diagnostic queries per image. Failure at any stage (e.g., mislocalizing an abnormality after correctly identifying modality and organ) leads to an overall task failure, exposing frailty in step-wise reasoning chains.
4. Methodological Significance and Technical Formulation
The core significance of probing evaluation is its ability to rigorously differentiate models that appear competitive on surface metrics but lack sufficient diagnostic granularity. The approach formalizes accuracy by strict per-image, dimension-specific aggregation:
This strict metric penalizes error propagation in multi-step evaluations and enforces robust integration across clinical reasoning stages. Detailed error breakdowns are enabled by the systematic pairing of every diagnostic query with its adversarial negation, making the evaluation sensitive to overfitting, shortcut exploitation, and domain-specific hallucination.
5. Implications for Domain Knowledge and Model Design
The findings from ProbMed underscore that current LMMs and general-purpose vision-LLMs, though adept at general visual and textual tasks, do not possess sufficient diagnostic specificity for deployment in critical clinical settings. Specialized domain models like CheXagent outperform LMMs in domain-constrained settings (e.g., chest X-ray conditions) and demonstrate some out-of-domain transfer, but even these architectures struggle with fine-grained attributes outside their training distribution.
A central implication is that robust medical diagnosis by automated systems requires:
- Explicit training with adversarial, probing-style diagnostic questions to harden against hallucination.
- Modular, sequential or agentic reasoning pipelines that mimic clinical workflows and enable error correction at each diagnostic stage.
- Domain- and task-specific models or adapters, as generic LMMs do not generalize reliably to nuanced medical reasoning without such interventions.
6. Future Directions Identified by Probing Paradigms
The adoption of probing evaluation frameworks like ProbMed directs future research toward:
- Improved training regimens that incorporate adversarial examples and multi-step procedural reasoning.
- Integration of domain-specific toolchains as seen in agentic workflows (e.g., segmentation and quantitative assessment in ophthalmology or radiology pipelines) (Wang et al., 21 Mar 2025).
- Systematic preference-based or expert-in-the-loop evaluations as additional layers of assurance for clinical adoption (Ruan et al., 7 Dec 2024).
- Expansion of benchmarking beyond binary question answering into open-ended clinical report generation and case-based reasoning, aligned with real patient journeys.
This suggests that reliable AI medical diagnosis will depend not only on raw accuracy but also on explicit robustness to probing, adversarial challenge, and multi-stage reasoning—haLLMarks captured by the ProbMed evaluation paradigm.
7. Summary Table: Performance Drop Under Probing Evaluation
Model | Modality/Organ Recognition | Specialized Diagnosis (ProbMed) | Accuracy Drop with Probing |
---|---|---|---|
GPT-4V | >90% | 35-40% or lower | ~42–45% |
Gemini Pro | >90% | 35-40% or lower | ~42–45% |
CheXagent | High for Chest X-ray | Robust on domain, drops OOD | Moderate |
LLaVA-Med | Moderate | Struggles with general tasks | High |
The table illustrates that apparent proficiency on general benchmarks does not equate to clinically sufficient reliability under stringent, adversarial, or procedural challenge (Yan et al., 30 May 2024, Ruan et al., 7 Dec 2024). The design and adoption of probing evaluation, therefore, mark a pivotal step toward truly dependable AI systems in medicine.