- The paper introduces CPC-Bench and Dr. CaBot, a comprehensive system that leverages a century of clinicopathological cases to rigorously benchmark LLM performance.
- It demonstrates that advanced LLMs, especially the o3 model, significantly outperform physicians in diagnostic accuracy with a 29 percentage point gain in sequential analysis.
- The study highlights persistent challenges in multimodal integration and literature retrieval, emphasizing the need for broader benchmarks in real-world clinical settings.
Advancing Medical AI with a Century of Clinicopathological Cases: An Expert Analysis
Introduction
This paper presents a comprehensive effort to advance the evaluation and development of medical artificial intelligence by leveraging a century’s worth of Clinicopathological Conferences (CPCs) from the New England Journal of Medicine. The authors introduce CPC-Bench, a large-scale, physician-validated benchmark encompassing ten diverse text-based and multimodal tasks, and Dr. CaBot, an AI system designed to emulate expert clinical discussants in both written and video formats. The work systematically evaluates leading LLMs on these tasks, quantifies their strengths and limitations, and provides a public resource for ongoing benchmarking and research.
Construction of CPC-Bench and Annotation Pipeline
The dataset comprises 7,102 CPCs (1923–2025) and 1,021 NEJM Image Challenges (2006–2025). Pre-1945 cases were digitized using a vision-LLM, and all cases were segmented and annotated for clinical events, event types, and test results. Ten physicians performed double-blind annotations on a stratified sample, which were then used to train and validate an LLM-based annotation pipeline. This approach enabled scalable, high-fidelity extraction of diagnostic touchpoints and event-level information, supporting granular evaluation of model reasoning as new information is revealed.
CPC-Bench defines ten tasks, including differential diagnosis generation, testing plan formulation, literature search, sequential diagnostic reasoning, information omission, visual question answering, and image-only diagnosis. Each task is rigorously validated, with LLM-based judges cross-validated against physician annotations, achieving high concordance (e.g., 86% accuracy, F1 89% for differential diagnosis scoring).
Evaluation of LLMs on CPC-Bench
Text-Based Diagnostic Reasoning
On 377 contemporary CPCs, the o3 model (OpenAI) achieved 60% top-1 and 84% top-10 accuracy in final diagnosis ranking, substantially outperforming a 20-physician baseline (24% top-1, 45% top-10) and other LLMs (e.g., Gemini 2.5. Pro at 78% top-10, Claude 4.0 Sonnet at 69%). Next-test selection accuracy for o3 reached 98%. Sequential event-level analysis demonstrated that diagnostic accuracy increases as more information is revealed, with o3 showing a 29 percentage point gain from event 1 to event 5. Notably, omission of key normal findings led to a 4–5% drop in top-1 accuracy, underscoring the importance of negative evidence in clinical reasoning.
Literature Search and Clinical Knowledge
LLMs performed less robustly on literature retrieval. Gemini 2.5. Pro achieved 49% top-10 citation accuracy, while o3 reached 32%. Retrieval-augmented generation (RAG) improved performance (o4-mini: 47%), but citation accuracy declined for older and very recent articles. On multiple-choice clinical knowledge questions, o3 achieved 88% accuracy, with other frontier models in the 84% range.
Image Interpretation and Multimodal Tasks
Image-based tasks remain a significant challenge. In the NEJM Image Challenge, Gemini 2.5. Pro led with 84% accuracy, followed by o3 at 82%. When restricted to image-only diagnosis, accuracy dropped to 67% for both models, with dermatology images yielding the highest performance (o3: 76%) and radiology the lowest (o3: 55%). On the Visual Differential Diagnosis task (images and tables only), o3 achieved 19% top-1 and 40% top-10 accuracy, indicating substantial limitations in multimodal integration.
Dr. CaBot: Emulating the Expert Discussant
Dr. CaBot is an agentic system built on o3, capable of generating written and video differential diagnoses in the style of NEJM CPC discussants. The system retrieves similar historical cases using embedding similarity, mimics the discussant’s style, and iteratively queries a clinical literature search engine (OpenAlex, filtered to high-impact journals) to ground its reasoning. Video presentations are generated via LaTeX Beamer slides, synthesized narration, and FFmpeg-based assembly.
In blinded evaluations, physicians misclassified the source of the differential diagnosis (AI vs. human) in 74% of trials, and rated CaBot more favorably than human discussants across quality, justification, citation, and engagement metrics. This demonstrates that LLMs can now convincingly emulate expert clinical reasoning and presentation, at least in the highly-curated CPC format.
Historical Benchmarking and Disease Evolution
The authors benchmarked LLMs and human discussants across 5,673 CPCs spanning a century. o3 matched or exceeded expert discussant performance in every decade after the 1960s, with the largest margin in the 1940s–1950s (o3: 74%, experts: 62%). Diagnostic accuracy peaked in the 2000s–2010s (o3: 87%, experts: 86%). The analysis also documents the shifting epidemiology of CPC cases, with infectious diseases dominating early decades and malignancies rising post-1950s, reflecting both real-world disease trends and editorial selection.
Implications for Medical AI and Future Directions
Theoretical and Practical Implications
- Scaling Laws Dominate: The primary driver of LLM performance on medical reasoning tasks is model scale, not prompt engineering or domain-specific fine-tuning. The transition from GPT-3.5 (44% accuracy) to o3 (84%) occurred without additional clinical fine-tuning, paralleling trends in general ML.
- Limits of Current Benchmarks: CPCs, while valuable, are highly curated and information-dense, potentially overestimating real-world diagnostic performance. The tasks in CPC-Bench do not encompass long-context reasoning, structured EHR data, or outcome prediction.
- Multimodal Integration Remains Challenging: Despite strong text-based performance, LLMs lag in image interpretation and literature retrieval, with significant room for improvement in multimodal clinical reasoning.
- Human-AI Indistinguishability: The inability of physicians to reliably distinguish AI-generated from human-generated differentials, and their preference for the former, suggests that LLMs can now serve as credible educational tools and potentially as clinical decision support in structured settings.
Future Developments
- Expanded Benchmarks: Incorporation of long-context reasoning, structured data, and real-world clinical outcomes will be necessary to further stress-test and advance medical AI.
- Specialist Involvement: Broader annotation and evaluation by subspecialists will be required to assess generalizability beyond internal medicine.
- Open Resources: The public release of CPC-Bench, Dr. CaBot, and leaderboards provides a foundation for transparent, reproducible, and longitudinal tracking of progress in medical AI.
Conclusion
This work establishes a new standard for evaluating medical AI by leveraging a century of CPCs, introducing a comprehensive benchmark and an AI discussant capable of emulating expert reasoning and presentation. While LLMs now rival or surpass human experts in text-based diagnostic reasoning and presentation, significant challenges remain in literature retrieval, image interpretation, and multimodal integration. The public release of CPC-Bench and Dr. CaBot will facilitate ongoing research, benchmarking, and progress tracking, supporting both the development of more capable clinical AI systems and the rigorous assessment of their limitations.