ProbMed: Probing Evaluation in Medical Diagnosis

Updated 22 September 2025

ProbMed is a rigorous evaluation framework that challenges automated diagnostic models with adversarial examples and multi-step reasoning tasks to expose hidden errors.
It incorporates adversarial questions and detailed procedural diagnostics to simulate real clinical challenges such as modality, organ, and abnormality identification.
Empirical results show that even top-performing multimodal models can drop to 35-40% accuracy on fine-grained diagnostic tasks when subjected to ProbMed testing.

Probing Evaluation for Medical Diagnosis (ProbMed) is a rigorous methodology for assessing the performance, robustness, and diagnostic reliability of automated models—particularly large multimodal models—in medical imaging and clinical decision tasks. Unlike conventional benchmark testing, probing evaluation deliberately exposes model brittleness using adversarial constructs, multi-step diagnostic workflows, and dimension-specific metrics to simulate real clinical challenges. This approach clarifies diagnostic specificity and highlights the limitations of even top-performing systems in handling fine-grained, multi-stage medical reasoning.

1. Probing Evaluation Methodology

Probing evaluation systematically interrogates models beyond routine accuracy metrics by introducing adversarial, paired questions and requiring models to distinguish genuine findings from hallucinated (non-existent) features. For each ground-truth (“yes”) diagnostic question (e.g., “Is there evidence of abnormality X in region R?”), an adversarial negated counterpart is generated that references an attribute or abnormality not present in the image. Models are thus directly tested on their ability to reject false positives and recognize fine-grained attributes, a setting that uncovers overconfidence and hallucination errors.

The methodology is not limited to isolated question-answering but includes procedural diagnostics, which require the model to reason correctly across sequential diagnostic dimensions—such as first recognizing image modality and organ before detecting abnormalities and identifying findings with precise anatomical localization. This approach systematically quantifies not only task-level performance but also error propagation in diagnostic reasoning, mirroring clinical workflows (Yan et al., 2024).

2. Description and Construction of the ProbMed Dataset

The ProbMed dataset consists of 6,303 medical images from existing benchmarks (MedICaT, ChestX-ray14), encompassing X-ray, CT, and MRI modalities and multiple organs (abdomen, brain, chest, spine). For each image, detailed metadata is auto-curated using GPT-4 and a dedicated positional reasoning engine, cataloging condition names, abnormalities, and position descriptors. Using this structured metadata, 57,132 question–answer pairs are generated: each ground-truth question (positive instance) is paired with a corresponding adversarial (negated) question constructed by selecting alternate organs, modalities, or hallucinated attributes.

Questions are organized along five diagnostic dimensions:

Dimension	Example Task	Evaluation Focus
Modality Recognition	Distinguish between X-ray and CT	Basic comprehension
Organ Identification	Identify organ imaged (e.g., chest vs brain)	Anatomical grounding
Abnormality Detection	Detect presence of specific abnormality	Fine-grained clinical feature
Condition/Finding	Assign clinical finding/label	Diagnostic specificity
Positional Reasoning	Localize abnormality within organ	Spatial accuracy

This structure enables axis-specific analysis of both general and highly specialized reasoning required for clinically valid diagnosis (Yan et al., 2024).

3. Impact and Results of Probing Evaluation on State-of-the-Art Models

Empirical results on ProbMed demonstrate a stark contrast between conventional benchmark accuracy and performance under probing evaluation. State-of-the-art large multimodal models (LMMs) such as GPT-4o, GPT-4V, and Gemini Pro, while achieving >90% accuracy on general recognition tasks (modality and organ ID), perform worse than random guessing on specialized diagnostic questions when challenged with adversarial pairs. For instance, their accuracy in fine-grained condition/finding identification or positional inference can drop to 35–40% or lower—camouflaging their limitation on typical leaderboards. This performance degradation reflects a pronounced tendency to hallucinate (agreeing with false or irrelevant features) and confuses visual-semantic cues under adversarial perturbations.

Error analyses further show a substantial portion of mistakes stem from failures to reject hallucinated conditions or positions. For example, a model may accurately recognize the imaging modality but accept an adversarial attribute (e.g., reporting a nonexistent pulmonary mass), illustrating poor clinical reliability in high stakes applications (Yan et al., 2024).

Procedural diagnosis evaluation compounds this effect: models must correctly answer a sequence of dependent diagnostic queries per image. Failure at any stage (e.g., mislocalizing an abnormality after correctly identifying modality and organ) leads to an overall task failure, exposing frailty in step-wise reasoning chains.

4. Methodological Significance and Technical Formulation

The core significance of probing evaluation is its ability to rigorously differentiate models that appear competitive on surface metrics but lack sufficient diagnostic granularity. The approach formalizes accuracy by strict per-image, dimension-specific aggregation:

$\text{accuracy}_{\text{category}} = \frac{\text{Number of images with all correct answers in that category}}{\text{Total number of images}}$

This strict metric penalizes error propagation in multi-step evaluations and enforces robust integration across clinical reasoning stages. Detailed error breakdowns are enabled by the systematic pairing of every diagnostic query with its adversarial negation, making the evaluation sensitive to overfitting, shortcut exploitation, and domain-specific hallucination.

5. Implications for Domain Knowledge and Model Design

The findings from ProbMed underscore that current LMMs and general-purpose vision-LLMs, though adept at general visual and textual tasks, do not possess sufficient diagnostic specificity for deployment in critical clinical settings. Specialized domain models like CheXagent outperform LMMs in domain-constrained settings (e.g., chest X-ray conditions) and demonstrate some out-of-domain transfer, but even these architectures struggle with fine-grained attributes outside their training distribution.

A central implication is that robust medical diagnosis by automated systems requires:

Explicit training with adversarial, probing-style diagnostic questions to harden against hallucination.
Modular, sequential or agentic reasoning pipelines that mimic clinical workflows and enable error correction at each diagnostic stage.
Domain- and task-specific models or adapters, as generic LMMs do not generalize reliably to nuanced medical reasoning without such interventions.

6. Future Directions Identified by Probing Paradigms

The adoption of probing evaluation frameworks like ProbMed directs future research toward:

Improved training regimens that incorporate adversarial examples and multi-step procedural reasoning.
Integration of domain-specific toolchains as seen in agentic workflows (e.g., segmentation and quantitative assessment in ophthalmology or radiology pipelines) (Wang et al., 21 Mar 2025).
Systematic preference-based or expert-in-the-loop evaluations as additional layers of assurance for clinical adoption (Ruan et al., 2024).
Expansion of benchmarking beyond binary question answering into open-ended clinical report generation and case-based reasoning, aligned with real patient journeys.

This suggests that reliable AI medical diagnosis will depend not only on raw accuracy but also on explicit robustness to probing, adversarial challenge, and multi-stage reasoning—hallmarks captured by the ProbMed evaluation paradigm.

7. Summary Table: Performance Drop Under Probing Evaluation

Model	Modality/Organ Recognition	Specialized Diagnosis (ProbMed)	Accuracy Drop with Probing
GPT-4V	>90%	35-40% or lower	~42–45%
Gemini Pro	>90%	35-40% or lower	~42–45%
CheXagent	High for Chest X-ray	Robust on domain, drops OOD	Moderate
LLaVA-Med	Moderate	Struggles with general tasks	High

The table illustrates that apparent proficiency on general benchmarks does not equate to clinically sufficient reliability under stringent, adversarial, or procedural challenge (Yan et al., 2024, Ruan et al., 2024). The design and adoption of probing evaluation, therefore, mark a pivotal step toward truly dependable AI systems in medicine.

PDF Markdown Chat (Pro)

References (3)

Worse than Random? An Embarrassingly Simple Probing Evaluation of Large Multimodal Models in Medical VQA (2024)

MedAgent-Pro: Towards Evidence-based Multi-modal Medical Diagnosis via Reasoning Agentic Workflow (2025)

Comprehensive Evaluation of Multimodal AI Models in Medical Imaging Diagnosis: From Data Augmentation to Preference-Based Comparison (2024)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Probing Evaluation for Medical Diagnosis (ProbMed).

ProbMed: Probing Evaluation in Medical Diagnosis

1. Probing Evaluation Methodology

2. Description and Construction of the ProbMed Dataset

3. Impact and Results of Probing Evaluation on State-of-the-Art Models

4. Methodological Significance and Technical Formulation

5. Implications for Domain Knowledge and Model Design

6. Future Directions Identified by Probing Paradigms

7. Summary Table: Performance Drop Under Probing Evaluation

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

ProbMed: Probing Evaluation in Medical Diagnosis

1. Probing Evaluation Methodology

2. Description and Construction of the ProbMed Dataset

3. Impact and Results of Probing Evaluation on State-of-the-Art Models

4. Methodological Significance and Technical Formulation

5. Implications for Domain Knowledge and Model Design

6. Future Directions Identified by Probing Paradigms

7. Summary Table: Performance Drop Under Probing Evaluation

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research