Radiology's Last Exam (RadLE) Benchmark
- Radiology's Last Exam (RadLE) is a benchmark that evaluates advanced multimodal AI systems using 50 challenging 'spot diagnosis' imaging cases curated through expert radiologist input.
- It employs a rigorous methodology by testing multiple AI models in simulated clinical scenarios with strict diagnostic protocols to mirror real-world radiology exams.
- Results show a significant performance gap, with expert radiologists achieving 83% accuracy compared to the highest-performing AI model at 30%, underscoring the need for improved AI diagnostic reasoning.
Radiology’s Last Exam (RadLE) is a pilot benchmark for evaluating the diagnostic reasoning capabilities of advanced multimodal artificial intelligence systems in comparison with expert human radiologists. Developed to address the limitations of existing evaluations—often restricted to public datasets featuring common pathologies—RadLE prioritizes difficult, expert-level “spot diagnosis” cases spanning multiple imaging modalities. Results demonstrate a persistent performance gap: current frontier models lag markedly behind experts in challenging diagnostic scenarios. In addition to quantitative benchmarking, RadLE takes a systematic approach to analyzing reasoning errors, proposing a taxonomy of visual reasoning failures unique to AI interpretation of medical images.
1. Benchmark Construction and Case Selection
RadLE comprises 50 difficult “spot diagnosis” cases, each requiring a single, specific diagnosis based solely on provided imaging—mirroring real-world board exams and clinical practice. Case development proceeded via a crowdsourcing effort with input from radiologists and residents at multiple institutions. Two board-certified radiologists, each with more than five years’ clinical experience, independently screened and selected cases for inclusion.
Selection criteria mandated:
- A complex diagnostic scenario suitable for advanced reasoning,
- A single, unambiguous reference diagnosis per case (no ancillary clinical data necessary),
- Exclusion of cases requiring multimodality correlation or broad differentials.
Modalities are distributed as follows: | Imaging Modality | Case Count | Proportion | |----------------------|------------|------------| | Radiography (X-ray) | 13 | 26% | | Computed Tomography | 24 | 48% | | Magnetic Resonance | 13 | 26% |
Systems represented include cardiothoracic, gastrointestinal, genitourinary, musculoskeletal, neuro/head & neck, and pediatric categories. This composition ensures rigorous assessment across anatomical and pathological domains.
2. Model Cohorts and Evaluation Protocols
Five generalist multimodal AI systems were chosen for evaluation:
- OpenAI o3
- OpenAI GPT-5
- Gemini 2.5 Pro
- Grok-4
- Claude Opus 4.1
Models were tested through their respective native web interfaces in “reasoning” or “thinking” modes to simulate extended, deliberate diagnostic attempts. For GPT-5, additional analyses employed API-based prompting in three reasoning modes (low, medium, high effort). Prompts specified strict output: only the most specific diagnostic text, no summary or extraneous language, facilitating direct comparison against human readers.
AI outputs were reviewed by blinded experts, and reproducibility was assessed through three independent runs for each model. Human performance was measured for board-certified radiologists and radiology trainees; scores were averaged across all participants in each cohort.
3. Performance Metrics and Statistical Analysis
Diagnostic accuracy was graded using a reference-standard approach:
- 1.0: Exact match to reference diagnosis
- 0.5: Partially correct differential diagnosis
- 0.0: Incorrect diagnosis
Key results: | Cohort | Mean Accuracy | 95% Confidence Interval | |--------------------------|--------------|------------------------| | Board-certified Radiologists | 83% | 75%–90% | | Radiology Trainees | 45% | 39%–52% | | GPT-5 (Best AI Model) | 30% | 20%–42% | | Gemini 2.5 Pro | 29% | 19%–39% | | OpenAI o3 | 23% | 14%–33% | | Grok-4 | 12% | 6%–19% | | Claude Opus 4.1 | 1% | 0%–3% |
Statistical methods included the Friedman rank test (χ² = 336, Kendall’s W = 0.56, p < 1×10⁻⁶⁴). Reliability was quantified via intra-class correlation (ICC), revealing substantial agreement for GPT-5 and o3, moderate for Gemini 2.5 Pro and Grok-4, and poor for Claude Opus 4.1.
Modality- and system-specific breakdowns, as well as Wilson confidence intervals, are presented in supplementary materials.
4. Reasoning Quality Analysis and Taxonomy of Errors
Qualitative expert review identified reasoning failures in AI outputs, culminating in a taxonomy of visual reasoning errors divided into three principal categories:
Perceptual Errors
- Under-detection: Failure to identify readily visible abnormalities (e.g., missed ureterocele dilatation).
- Over-detection: False positives for non-existent findings (e.g., hallucinated renal cysts).
- Mislocalization: Correct recognition of abnormality but assignment to incorrect anatomical region (e.g., mediastinal cyst attributed to heart chamber).
Interpretive Errors
- Misinterpretation/misattribution: Incorrect pathologic assignment following correct detection (e.g., AC joint injury misdiagnosed as shoulder dislocation).
- Premature diagnostic closure: Early cessation of reasoning, omitting alternate diagnoses (e.g., single-hypothesis conclusion in complex neuroimaging).
Communication Errors
- Findings–summary discordance: Contradiction between observed image features and final diagnostic label.
Additional error modifiers include cognitive biases:
- Confirmation/anchoring, availability, inattentional, and prompt framing effects were observed to distort model reasoning.
This taxonomy provides a framework for systematic analysis and future model improvement. For instance, “under-detection” and “over-detection” lead to perceptual failures, while “premature closure” and “misattribution” relate to reasoning procedural lapses.
5. Clinical Implications and Risks of Model Deployment
RadLE demonstrates a pronounced gap between human and AI diagnostic accuracy in expert-level cases. While radiologists achieved 83% accuracy, the best-performing AI was at 30%. Even high-effort prompting in GPT-5 did not significantly improve diagnostic yield, increasing latency ~6x with only marginal improvement.
These findings lead to the following practical implications:
- Generalist frontier models are not suitable for unsupervised clinical use in challenging, high-stakes radiologic diagnosis.
- Rigorous external validation and mandatory expert oversight are required prior to any clinical deployment.
- Model refinement must emphasize perceptual sensitivity and stable, error-resistant reasoning processes.
Reliability and reproducibility deficits—especially in less capable models—underscore the risk of inconsistent outputs. The results caution against the uncritical adoption of consumer-facing chatbots for clinical radiology without stringent safeguards and case-specific validation.
6. Impact on Evaluation Standards and Future Directions
The establishment of RadLE provides an actionable benchmark for future development and validation of multimodal generalist AI in radiology. By focusing on realistic, diagnostically difficult cases and articulating a reasoning error taxonomy, RadLE guides both model improvement and evaluation protocol design.
Recognized limitations include:
- Case count (n=50) restricts statistical generalizability; future versions could scale with expanded datasets and diverse pathology spectra.
- Absence of ancillary data and multimodality correlation (imposed for purity of imaging diagnosis) may not reflect all clinical scenarios.
Nevertheless, RadLE offers:
- Strong evidence for the current boundary between expert radiologist and AI diagnostic capability in medical imaging.
- A practical taxonomy of reasoning errors that provides systematic targets for fine-tuning and model architecture innovations.
This benchmark is expected to catalyze higher standards in both clinical validation and AI research, informing the responsible design and deployment of AI systems in expert-level radiologic diagnostic tasks (Datta et al., 29 Sep 2025).