NEJM CPC Cases: A Benchmark in Diagnostic Reasoning
- NEJM Clinicopathological Conference cases are real-world diagnostic puzzles featuring multifaceted, temporally ordered clinical data and autopsy-confirmed diagnoses.
- They drive methodological advances in computational models, including noisy OR-gate and deep learning approaches, to enhance diagnostic reasoning.
- Benchmark datasets derived from these cases enable rigorous evaluation of both human experts and AI systems in complex, real-world scenarios.
The New England Journal of Medicine Clinicopathological Conference (NEJM CPC) cases are a benchmark for complex diagnostic reasoning, representing authentic real-world clinical puzzles that challenge both experienced physicians and computational models. These cases are characterized by multifaceted presentations, rare disease prevalence, evolving temporally-ordered clinical information, and confirmatory gold-standard diagnoses, often derived from autopsy or comprehensive clinical consensus. The analysis and application of NEJM CPC cases have spurred significant methodological advances in medical decision support, dataset construction, LLM evaluation, and computational pathology.
1. Structure and Nature of NEJM CPC Cases
NEJM CPC cases are structured as narrative diagnostic challenges, typically sourced from the Massachusetts General Hospital (MGH) and other tertiary care centers. Each case is based on a real patient and unfolds through temporally sequenced clinical observations, including initial presentations, laboratory and imaging findings, and consult notes across specialties. These features mirror the stepwise reasoning of clinical practice, requiring the integration of evolving information to construct and refine differential diagnoses.
Cases often feature:
- Rare or atypically presenting diseases.
- Multiple concurrent diagnoses (average of 1–7 target diagnoses per case in NEJM-inspired datasets).
- Chronicling of hypotheses, iterative collection of evidence, and diagnostic closure.
- Gold-standard reference diagnoses, pathologically or otherwise confirmed.
This structure, coupled with the case complexity, renders NEJM CPCs ideal benchmarks for both human diagnosticians and AI systems.
2. Computational Models in Diagnostic Reasoning
The evaluation of probabilistic inference models on CPC cases has illuminated key limitations and strengths in computational reasoning under uncertainty.
Noisy OR-Gate, Multimembership Bayes, and Simple Bayes Models (1303.1463)
- Noisy OR-gate Model: Assumes marginal independence of diseases, conditional independence of findings, and models the interaction between diseases and findings through a noisy OR mechanism. Causal independence is captured, enabling partial competition among diseases for explanatory power. Posterior probability calculation uses disease priors, disease–finding conditional probabilities, and leak probabilities.
- Multimembership Bayes Model: Models each disease in isolation, ignoring interaction, yielding systematic overestimation of comorbid diagnosis probabilities due to double-counting of explanatory findings.
- Simple Bayes Model: Restricts reasoning to a single disease, enforcing mutually exclusive diagnoses, which systematically underestimates disease probabilities in the presence of true comorbidity.
Empirical evaluation on 20 CPC cases demonstrated that the noisy OR-gate model yields probability distributions most closely matching gold-standard (autopsy-confirmed) diagnoses. In contrast, the multimembership Bayes model often grossly overestimates probabilities due to evidence overcounting, and the simple Bayes model underestimates comorbid diagnoses due to forced exclusivity. These results highlight the need for causal interaction modeling in diagnostic AI and establish the noisy OR-gate model’s utility, while identifying substantial residual gaps for future refinement.
Model | Assumptions | Posterior Estimate Pattern | Main Diagnostic Flaw |
---|---|---|---|
Simple Bayes | Mutually exclusive diseases; independent findings | Underestimates when multi-fault | Misses comorbidities |
Multimembership Bayes | Marginally independent diseases; isolated findings | Overestimates (double-counts evidence) | Inflates comorbidities |
Noisy OR-gate | Marginal independence; noisy OR causal independence | Best match; intermediate | Calibration/refinement gap |
3. Benchmark Datasets Inspired by NEJM CPCs
Recent research has produced several benchmarking datasets that expand the scope and reproducibility of CPC-inspired diagnostic evaluation.
DC³ — Diagnostic Case Challenge Collection (1908.08581)
DC³ compiles 31 diagnostically challenging NEJM CPC-MGH cases, mapped with temporally ordered, physician-generated observations and gold-standard diagnoses. Each case is annotated with Unified Medical Language System (UMLS) Concept Unique Identifiers (CUIs), supporting multi-label classification (mean 6.9 diagnoses/case). The dataset links cases to relevance-mapped PubMed biomedical literature using high-performance Named Entity Recognition (NER), enabling robust information retrieval (IR) system evaluation using metrics such as normalized Discounted Cumulative Gain (nDCG):
This format supports research into literature-based decision support, algorithmic benchmarking, and the development of evaluation pipelines aligned with authentic clinical reasoning.
CUPCase Dataset (2503.06204)
The CUPCase dataset generalizes this paradigm, curating 3,562 real-world case reports with free-text and multiple-choice gold-standard diagnoses, explicitly selected for rare diseases, uncommon presentations, and uncharacteristic treatment responses. Diagnoses are assessed by both MCQ accuracy and semantic similarity (BERTScore F1) in open-ended (free-text) generation:
Empirical findings show that the general-purpose GPT-4o model attains MCQ accuracy of 87.9% and BERTScore F1 of 0.764, outperforming both domain-general and clinical LLMs. Importantly, these models maintain high diagnostic power with partial information (first 20% of case tokens), highlighting their utility in simulating stepwise diagnostic inference as in CPCs.
4. Sequential Diagnostic Evaluation and AI-Orchestrated Panels
Recent advances advocate for stepwise, interactive diagnostic evaluation that mirrors real clinical workflows, moving beyond static vignette evaluation.
Sequential Diagnosis Benchmark (SDBench) and MAI Diagnostic Orchestrator (MAI-DxO) (2506.22405)
SDBench transforms 304 NEJM CPCs into interactive encounters. Agents (human or AI) receive minimal initial information and proceed to iteratively request history, order tests, or make diagnostic decisions. All information is revealed only in response to explicit queries, mediated by a gatekeeper model. Diagnostic performance is measured both in terms of accuracy and cost, reflecting cumulative physician visits and test expenditures.
MAI-DxO orchestrates AI reasoning via a simulated panel of “virtual physicians” with distinct reasoning roles: differential maintenance, test selection, adversarial challenge (to counter anchoring and premature closure), stewardship (cost and test appropriateness), and checklist assurance. The orchestrator systematically applies Bayesian updates and information-theoretic test selection:
Empirical results show MAI-DxO achieves 80% diagnostic accuracy, quadrupling the 20% accuracy observed for generalist physicians, and reduces diagnostic costs by 20% compared to physicians and by 70% compared to off-the-shelf LLMs. Ensemble strategies yield up to 85.5% accuracy. Performance improvements generalize across multiple LLM providers (OpenAI, Gemini, Claude, Grok, DeepSeek, Llama).
Agent | Accuracy (%) | Average Cost (USD) |
---|---|---|
Generalist Physicians | 19.9 | $2,963 |
Off-the-shelf GPT-4o | 49.3 | $2,745 |
Off-the-shelf o3 | 78.6 | $7,850 |
MAI-DxO (o3, No Budget) | 81.9 | $4,735 |
MAI-DxO (Ensemble) | 85.5 | $7,184 |
5. Deep Learning Applications in Pathology and Diagnostic Decision Support
Integrative deep learning approaches have emerged as assistive tools in the interpretation of complex clinical and pathological cases.
Deep Learning-Based Computational Pathology (2006.13932)
For Cancers of Unknown Primary (CUP), the TOAD model leverages multi-task, attention-based deep learning applied to high-resolution digitized histopathology slides, generating probabilistic differentials for tumor origin. Using a training corpus of 17,486 whole slide images (across 18 primary origins), TOAD attains internal test top-1 accuracy of 83.6% and top-3 accuracy of 94.4%; on external tests (662 cases), accuracy is 78.5% (top-1) and 92.6% (top-3), with strong performance even in unseen institutional settings.
The model’s output—differential diagnosis probabilities and attention maps—enables streamlined clinical triage for further immunohistochemistry, molecular studies, or imaging, particularly benefiting resource-constrained environments by reducing reliance on expansive testing.
6. Evaluation of LLMs on Open-Ended Clinical Reasoning
LLMs have been benchmarked on CPC-style cases to assess their capability in generating clinically meaningful differentials.
LLMs on Case Records (2305.05609)
GPT-4 and GPT-3.5, when prompted for top-three diagnoses on 50 recent NEJM CPC cases, achieved top-1 diagnosis accuracy of 26% and 22% (respectively), improving to 46% and 42% when considering any of the top three suggestions. Diagnostic test selection followed similar patterns. Repeated trials yielded 62% recovery of the correct diagnosis within fifteen attempts. These findings demonstrate that current LLMs can usefully expand the differential but are not yet precise enough for independent diagnostic closure in complex, open-ended scenarios.
Metric | GPT-4 | GPT-3.5 |
---|---|---|
Correct Diagnosis (Choice 1) | 26% | 22% |
Correct Diagnosis (Any of Choices 1-3) | 46% | 42% |
Correct Diagnostic Test (Choice 1) | 28% | 24% |
Correct Diagnostic Test (Choices 1-3) | 44% | 50% |
A plausible implication is that LLMs are positioned as adjuncts to support differential diagnosis, with their main value in broadening consideration rather than providing definitive answers.
7. Implications and Future Directions
The systematic evaluation of NEJM CPC cases—using both probabilistic models and state-of-the-art AI systems—reveals crucial lessons for computational clinical diagnosis:
- Modeling of disease interactions, competition, and comorbidity (as in the noisy OR-gate framework) is necessary to approximate real-world diagnostic reasoning.
- Temporal narrative datasets and benchmarks (DC³, CUPCase, SDBench) advance the evaluation of both human and artificial reasoning, emphasizing stepwise information accrual and evidence-based cost effectiveness.
- Orchestrated, panel-based LLM approaches that integrate adversarial roles and cost stewardship (MAI-DxO) outperform both naive models and generalist physicians, offering potential pathways toward scalable, equitable expert consultation in clinical care.
- Deep learning systems in pathology (e.g., TOAD) highlight the promise and limitations of data-driven diagnosis for complex entity recognition tasks, especially in under-resourced settings.
- Limitations of current models—including evidence overcounting, forced exclusivity, lack of temporal reasoning, and challenges with open-ended rare cases—define concrete targets for future improvement.
Collectively, the NEJM CPC tradition continues to drive methodological rigor and innovation in computational diagnostics, serving as a cornerstone for the development and benchmarking of next-generation clinical decision support systems.