Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 83 tok/s

Gemini 2.5 Pro 54 tok/s Pro

GPT-5 Medium 21 tok/s Pro

GPT-5 High 20 tok/s Pro

GPT-4o 103 tok/s Pro

Kimi K2 205 tok/s Pro

GPT OSS 120B 456 tok/s Pro

Claude Sonnet 4 35 tok/s Pro

2000 character limit reached

Advancing Medical Artificial Intelligence Using a Century of Cases (2509.12194v1)

Published 15 Sep 2025 in cs.AI and cs.CV

Abstract: BACKGROUND: For over a century, the New England Journal of Medicine Clinicopathological Conferences (CPCs) have tested the reasoning of expert physicians and, recently, AI. However, prior AI evaluations have focused on final diagnoses without addressing the multifaceted reasoning and presentation skills required of expert discussants. METHODS: Using 7102 CPCs (1923-2025) and 1021 Image Challenges (2006-2025), we conducted extensive physician annotation and automated processing to create CPC-Bench, a physician-validated benchmark spanning 10 text-based and multimodal tasks, against which we evaluated leading LLMs. Then, we developed "Dr. CaBot," an AI discussant designed to produce written and slide-based video presentations using only the case presentation, modeling the role of the human expert in these cases. RESULTS: When challenged with 377 contemporary CPCs, o3 (OpenAI) ranked the final diagnosis first in 60% of cases and within the top ten in 84% of cases, outperforming a 20-physician baseline; next-test selection accuracy reached 98%. Event-level physician annotations quantified AI diagnostic accuracy per unit of information. Performance was lower on literature search and image tasks; o3 and Gemini 2.5 Pro (Google) achieved 67% accuracy on image challenges. In blinded comparisons of CaBot vs. human expert-generated text, physicians misclassified the source of the differential in 46 of 62 (74%) of trials, and scored CaBot more favorably across quality dimensions. To promote research, we are releasing CaBot and CPC-Bench. CONCLUSIONS: LLMs exceed physician performance on complex text-based differential diagnosis and convincingly emulate expert medical presentations, but image interpretation and literature retrieval remain weaker. CPC-Bench and CaBot may enable transparent and continued tracking of progress in medical AI.

Summary

The paper introduces CPC-Bench and Dr. CaBot, a comprehensive system that leverages a century of clinicopathological cases to rigorously benchmark LLM performance.
It demonstrates that advanced LLMs, especially the o3 model, significantly outperform physicians in diagnostic accuracy with a 29 percentage point gain in sequential analysis.
The study highlights persistent challenges in multimodal integration and literature retrieval, emphasizing the need for broader benchmarks in real-world clinical settings.

Advancing Medical AI with a Century of Clinicopathological Cases: An Expert Analysis

Introduction

This paper presents a comprehensive effort to advance the evaluation and development of medical artificial intelligence by leveraging a century’s worth of Clinicopathological Conferences (CPCs) from the New England Journal of Medicine. The authors introduce CPC-Bench, a large-scale, physician-validated benchmark encompassing ten diverse text-based and multimodal tasks, and Dr. CaBot, an AI system designed to emulate expert clinical discussants in both written and video formats. The work systematically evaluates leading LLMs on these tasks, quantifies their strengths and limitations, and provides a public resource for ongoing benchmarking and research.

Construction of CPC-Bench and Annotation Pipeline

The dataset comprises 7,102 CPCs (1923–2025) and 1,021 NEJM Image Challenges (2006–2025). Pre-1945 cases were digitized using a vision-LLM, and all cases were segmented and annotated for clinical events, event types, and test results. Ten physicians performed double-blind annotations on a stratified sample, which were then used to train and validate an LLM-based annotation pipeline. This approach enabled scalable, high-fidelity extraction of diagnostic touchpoints and event-level information, supporting granular evaluation of model reasoning as new information is revealed.

CPC-Bench defines ten tasks, including differential diagnosis generation, testing plan formulation, literature search, sequential diagnostic reasoning, information omission, visual question answering, and image-only diagnosis. Each task is rigorously validated, with LLM-based judges cross-validated against physician annotations, achieving high concordance (e.g., 86% accuracy, F1 89% for differential diagnosis scoring).

Evaluation of LLMs on CPC-Bench

Text-Based Diagnostic Reasoning

On 377 contemporary CPCs, the o3 model (OpenAI) achieved 60% top-1 and 84% top-10 accuracy in final diagnosis ranking, substantially outperforming a 20-physician baseline (24% top-1, 45% top-10) and other LLMs (e.g., Gemini 2.5. Pro at 78% top-10, Claude 4.0 Sonnet at 69%). Next-test selection accuracy for o3 reached 98%. Sequential event-level analysis demonstrated that diagnostic accuracy increases as more information is revealed, with o3 showing a 29 percentage point gain from event 1 to event 5. Notably, omission of key normal findings led to a 4–5% drop in top-1 accuracy, underscoring the importance of negative evidence in clinical reasoning.

Literature Search and Clinical Knowledge

LLMs performed less robustly on literature retrieval. Gemini 2.5. Pro achieved 49% top-10 citation accuracy, while o3 reached 32%. Retrieval-augmented generation (RAG) improved performance (o4-mini: 47%), but citation accuracy declined for older and very recent articles. On multiple-choice clinical knowledge questions, o3 achieved 88% accuracy, with other frontier models in the 84% range.

Image Interpretation and Multimodal Tasks

Image-based tasks remain a significant challenge. In the NEJM Image Challenge, Gemini 2.5. Pro led with 84% accuracy, followed by o3 at 82%. When restricted to image-only diagnosis, accuracy dropped to 67% for both models, with dermatology images yielding the highest performance (o3: 76%) and radiology the lowest (o3: 55%). On the Visual Differential Diagnosis task (images and tables only), o3 achieved 19% top-1 and 40% top-10 accuracy, indicating substantial limitations in multimodal integration.

Dr. CaBot: Emulating the Expert Discussant

Dr. CaBot is an agentic system built on o3, capable of generating written and video differential diagnoses in the style of NEJM CPC discussants. The system retrieves similar historical cases using embedding similarity, mimics the discussant’s style, and iteratively queries a clinical literature search engine (OpenAlex, filtered to high-impact journals) to ground its reasoning. Video presentations are generated via LaTeX Beamer slides, synthesized narration, and FFmpeg-based assembly.

In blinded evaluations, physicians misclassified the source of the differential diagnosis (AI vs. human) in 74% of trials, and rated CaBot more favorably than human discussants across quality, justification, citation, and engagement metrics. This demonstrates that LLMs can now convincingly emulate expert clinical reasoning and presentation, at least in the highly-curated CPC format.

Historical Benchmarking and Disease Evolution

The authors benchmarked LLMs and human discussants across 5,673 CPCs spanning a century. o3 matched or exceeded expert discussant performance in every decade after the 1960s, with the largest margin in the 1940s–1950s (o3: 74%, experts: 62%). Diagnostic accuracy peaked in the 2000s–2010s (o3: 87%, experts: 86%). The analysis also documents the shifting epidemiology of CPC cases, with infectious diseases dominating early decades and malignancies rising post-1950s, reflecting both real-world disease trends and editorial selection.

Implications for Medical AI and Future Directions

Theoretical and Practical Implications

Scaling Laws Dominate: The primary driver of LLM performance on medical reasoning tasks is model scale, not prompt engineering or domain-specific fine-tuning. The transition from GPT-3.5 (44% accuracy) to o3 (84%) occurred without additional clinical fine-tuning, paralleling trends in general ML.
Limits of Current Benchmarks: CPCs, while valuable, are highly curated and information-dense, potentially overestimating real-world diagnostic performance. The tasks in CPC-Bench do not encompass long-context reasoning, structured EHR data, or outcome prediction.
Multimodal Integration Remains Challenging: Despite strong text-based performance, LLMs lag in image interpretation and literature retrieval, with significant room for improvement in multimodal clinical reasoning.
Human-AI Indistinguishability: The inability of physicians to reliably distinguish AI-generated from human-generated differentials, and their preference for the former, suggests that LLMs can now serve as credible educational tools and potentially as clinical decision support in structured settings.

Future Developments

Expanded Benchmarks: Incorporation of long-context reasoning, structured data, and real-world clinical outcomes will be necessary to further stress-test and advance medical AI.
Specialist Involvement: Broader annotation and evaluation by subspecialists will be required to assess generalizability beyond internal medicine.
Open Resources: The public release of CPC-Bench, Dr. CaBot, and leaderboards provides a foundation for transparent, reproducible, and longitudinal tracking of progress in medical AI.

Conclusion

This work establishes a new standard for evaluating medical AI by leveraging a century of CPCs, introducing a comprehensive benchmark and an AI discussant capable of emulating expert reasoning and presentation. While LLMs now rival or surpass human experts in text-based diagnostic reasoning and presentation, significant challenges remain in literature retrieval, image interpretation, and multimodal integration. The public release of CPC-Bench and Dr. CaBot will facilitate ongoing research, benchmarking, and progress tracking, supporting both the development of more capable clinical AI systems and the rigorous assessment of their limitations.