CPC-Bench: Benchmarking Medical AI Reasoning
- CPC-Bench is a physician-validated benchmarking resource that assesses AI clinical reasoning using over 7,100 NEJM CPC cases and 1,021 Image Challenge cases.
- It organizes evaluation into 10 discrete tasks including differential diagnosis, test planning, literature search, image analysis, and formal case presentation.
- State-of-the-art AI models, including agentic systems like Dr. CaBot, demonstrate near-human performance while highlighting challenges in image-based diagnosis and literature retrieval.
CPC-Bench is a physician-validated benchmarking resource developed to assess and track the clinical reasoning, diagnostic skills, and even presentation abilities of AI models using a century-spanning dataset derived from New England Journal of Medicine Clinicopathological Conferences (CPCs) and NEJM Image Challenges. Unlike benchmarks that evaluate AI solely on end-point diagnoses, CPC-Bench encompasses the complete spectrum of expert clinical case analysis, encompassing both textual and multimodal tasks, with a focus on realistic simulation of the expert discussant’s role in complex, real-world cases.
1. Historical Motivation and Scope
CPC-Bench originates from over 7,100 NEJM CPCs (1923–2025) and 1,021 NEJM Image Challenge cases. These resources provide an extensive and systematically curated body of clinical case material that has long served both as a scientific archive and a pedagogical tool for testing sophisticated medical reasoning. Existing AI evaluations have traditionally measured the rate at which models arrive at a correct diagnosis, but have not systematically quantified granular reasoning steps, diagnostic trajectory, management recommendations, literature retrieval, or the generation of expert-level presentations.
CPC-Bench is explicitly constructed to address these deficiencies by organizing the evaluation space into 10 discrete tasks designed to model authentic expert workflow. These tasks include differential diagnosis generation, test planning, literature search and citation, event-level reasoning, image-based diagnosis, question answering, information omission, and formal case presentation—each with corresponding datasets and rigorous performance rubrics.
2. Task Structure and Data Organization
The CPC-Bench framework divides its evaluation corpus into the following task categories, each reflecting key activities in expert clinical reasoning:
Task Name | Input Modality | Evaluation Metric(s) |
---|---|---|
Differential Diagnosis (DDx) | Free-text (“Pres. of Case”) | Top-1, Top-10 accuracy, mean rank |
Testing Plan | Free-text | 0–2 rubric (LLM judge, physician-validated) |
Literature Search | Free-text with citation | Top-10 hit-rate, accuracy by year/subfield |
Diagnostic Touchpoints | Event-segmented text | Accuracy per “event,” stepwise rank change |
QA + Visual QA | Text, VQA images | Exact match accuracy |
Information Omission | Case with normal findings removed | Top-10 DDx accuracy |
Visual Differential Diagnosis | Image (with/without vignette) | Percent accuracy by specialty |
NEJM Image Challenge | Image, MCQ | Accuracy, by specialty and prompt type |
Formal Presentations | Model-generated text, slides | Human/LLM preference, misclassification rate |
For many tasks, gold standards and scoring are derived directly from physician annotation or validated LLM evaluation pipelines. For example, the differential diagnosis task measures whether the final diagnosis is present in the model’s top-n ranked list (e.g., 60% top-1, 84% top-10 accuracy for the o3 model), with confidence intervals computed via the Clopper–Pearson method. Testing plan tasks use a rubric that denotes “fully correct,” “partially matching,” or “nonhelpful” responses (0–2 scale), assessed by both automated and physician judges.
3. Evaluation Methodologies and Statistical Framework
CPC-Bench employs multiple forms of quantitative and qualitative assessment, including:
- Rank-based accuracy: For differential diagnosis and literature search, evaluation is based on the presence and ordering of correct entities in ranked model outputs.
- Stepwise analysis (“Touchpoints”): Cases are segmented into sequential clinical events; at each stage, the diagnosis list is updated, and performance is tracked by mean and median rank as more information becomes available, utilizing bootstrap resampling for confidence intervals on incremental performance changes.
- Exact-match and multiple choice accuracy: Applied in QA, VQA, and Image Challenge tasks.
- LLM- and expert-annotated rubrics: Used in testing plan, information omission, and presentation quality assessment, benchmarked against validated human scoring.
- Multimodal embedding search: In literature citation retrieval, OpenAI text-embedding models are used to create a nearest-neighbors index of 1.6M abstracts, and the Euclidean distance is used to rank candidates.
- Blind source comparison: For the presentation task, physicians were challenged to distinguish AI-generated differentials from human CPC discussant text, resulting in misclassification in 74% of cases and indicating high fidelity AI emulation.
4. Performance of State-of-the-Art AI Models
CPC-Bench provides a standardized platform for head-to-head comparison of LLMs and retrieval-augmented systems in medical AI. The leading o3 (OpenAI) model achieved a top-1 diagnostic ranking in 60% and top-10 in 84% of contemporary CPCs, outperforming a baseline of 20 physicians. Testing plan selection accuracy reached 98%. For multimodal and literature retrieval tasks, performance was lower, with o3 and Gemini 2.5 Pro (Google) achieving 67% accuracy on image challenges. Event-level physician annotation of incremental diagnostic accuracy enables fine-grained quantification of how model reasoning evolves with accumulating information.
A salient result is that for most text-based differential diagnosis and reasoning tasks, LLMs now exceed or rival expert physician performance. However, there are persistent deficits in literature search (correct citation ranking) and nuanced image interpretation (particularly in specialties such as radiology).
5. Dr. CaBot: Agentic AI Discussant and Presentation Generator
Simultaneous with CPC-Bench, the “Dr. CaBot” system was developed as an agentic AI expert, modeled on the workflow of NEJM CPC discussants. Dr. CaBot operates as follows:
- Consumes only the “Presentation of Case” and retrieves two similar historical CPCs using embedding similarity from >6,000 cases.
- Composes differentials by stylistically mimicking the “Differential Diagnosis” sections of these comparators, yielding text and slide-based presentations in the expert format (including LaTeX Beamer slide code).
- Iteratively queries a custom clinical literature search engine accessing >1.6 million curated journal abstracts, citing relevant literature to support reasoning.
- In blinded studies, physicians could not reliably distinguish Dr. CaBot’s text from that of human experts and consistently rated its differential quality, diagnostic justification, citation quality, and learner engagement as equal or superior.
For example, Dr. CaBot’s slide code might render:
1 2 3 4 5 6 7 8 9 |
\begin{frame} \frametitle{Differential Diagnosis} \begin{itemize} \item Granulomatous hepatitis \item Drug-induced liver injury \item Viral hepatitis \item ... \end{itemize} \end{frame} |
6. Broader Implications for Medical AI Evaluation
CPC-Bench, by encompassing text-based, multimodal, and longitudinal event-based assessment, responds to the need for benchmarks that reflect the authentic complexity of clinical reasoning. Its release alongside Dr. CaBot establishes a rigorous, transparent infrastructure for tracking the progress of medical AI across multiple dimensions:
- It enables comparison of LLMs not just on diagnostic accuracy, but also on test planning, literature retrieval, incremental reasoning, image analysis, and practitioner-facing presentation.
- The ability of agentic systems like CaBot to generate high-quality, human-imitative clinical presentations points to new benchmarks for medical education and decision support.
- Persistent weaknesses (e.g., image-based diagnosis, literature search) delineate the present boundaries of LLM capability and indicate areas requiring further research and technical advancement.
A plausible implication is that the comprehensive nature of CPC-Bench may drive the next stage of evaluation for medical AI, emphasizing not just accuracy, but reasoning transparency, realism, and multimodal fluency.
7. Significance, Limitations, and Future Directions
CPC-Bench stands out in its scope (a century of cases), granularity (discrete task segmentation), and validation (physician annotation, LLM/human adjudication). Its data-driven, statistically robust approach enables continuous tracking of LLM progress, capturing both advances and persistent challenges. However, image interpretation and literature search remain suboptimal relative to expert clinicians, and additional work is necessary to expand specialty coverage, reduce bias, and further automate rubric-based judgment.
Looking forward, the release of CPC-Bench and CaBot is positioned to catalyze research in diagnostic reasoning, agentic clinical AI, and multimodal fusion, offering an evolving testbed for ongoing improvement in clinical AI systems.