Sequential Diagnosis with Language Models (2506.22405v1)

Published 27 Jun 2025 in cs.CL

Abstract: Artificial intelligence holds great promise for expanding access to expert medical knowledge and reasoning. However, most evaluations of LLMs rely on static vignettes and multiple-choice questions that fail to reflect the complexity and nuance of evidence-based medicine in real-world settings. In clinical practice, physicians iteratively formulate and revise diagnostic hypotheses, adapting each subsequent question and test to what they've just learned, and weigh the evolving evidence before committing to a final diagnosis. To emulate this iterative process, we introduce the Sequential Diagnosis Benchmark, which transforms 304 diagnostically challenging New England Journal of Medicine clinicopathological conference (NEJM-CPC) cases into stepwise diagnostic encounters. A physician or AI begins with a short case abstract and must iteratively request additional details from a gatekeeper model that reveals findings only when explicitly queried. Performance is assessed not just by diagnostic accuracy but also by the cost of physician visits and tests performed. We also present the MAI Diagnostic Orchestrator (MAI-DxO), a model-agnostic orchestrator that simulates a panel of physicians, proposes likely differential diagnoses and strategically selects high-value, cost-effective tests. When paired with OpenAI's o3 model, MAI-DxO achieves 80% diagnostic accuracy--four times higher than the 20% average of generalist physicians. MAI-DxO also reduces diagnostic costs by 20% compared to physicians, and 70% compared to off-the-shelf o3. When configured for maximum accuracy, MAI-DxO achieves 85.5% accuracy. These performance gains with MAI-DxO generalize across models from the OpenAI, Gemini, Claude, Grok, DeepSeek, and Llama families. We highlight how AI systems, when guided to think iteratively and act judiciously, can advance diagnostic precision and cost-effectiveness in clinical care.

Summary

The paper presents SDBench and MAI-DxO, a sequential diagnostic framework that simulates a virtual panel of physicians to markedly improve diagnostic accuracy and cost efficiency.
The methodology employs a multi-agent orchestration where language models simulate distinct clinical roles to iteratively gather and validate patient information.
Empirical results demonstrate significant gains over traditional methods, with LM-driven approaches achieving up to 80% accuracy and substantially lower costs compared to clinician baselines.

Sequential Diagnosis with LLMs: An Expert Overview

The paper "Sequential Diagnosis with LLMs" (2506.22405) presents a comprehensive framework for evaluating and improving the diagnostic reasoning capabilities of LLMs (LMs) in clinical settings. The authors introduce the Sequential Diagnosis Benchmark (SDBench), a novel evaluation suite that transforms 304 New England Journal of Medicine (NEJM) clinicopathological conference (CPC) cases into interactive, stepwise diagnostic encounters. They further propose the MAI Diagnostic Orchestrator (MAI-DxO), a model-agnostic orchestration system that simulates a virtual panel of physicians to optimize both diagnostic accuracy and cost-effectiveness.

Benchmark Design and Methodology

SDBench is constructed to emulate the iterative, information-seeking process characteristic of real-world clinical diagnosis. Each case begins with a brief patient vignette, after which the diagnostic agent (human or AI) must iteratively request additional information—either by asking questions or ordering diagnostic tests—before committing to a final diagnosis. The information flow is mediated by a Gatekeeper agent, implemented as a LLM with access to the full case file, which only reveals findings when explicitly queried. This design enforces a realistic constraint on information access and prevents leakage of diagnostic clues.

A key innovation is the Gatekeeper's ability to synthesize plausible, case-consistent findings for queries not covered in the original CPC narrative, thereby maintaining clinical realism and avoiding implicit signaling from missing data. The evaluation protocol incorporates both diagnostic accuracy (using a clinically validated, rubric-based Judge agent) and cumulative diagnostic cost, the latter estimated via CPT code mapping and U.S. health system pricing data.

MAI Diagnostic Orchestrator (MAI-DxO)

MAI-DxO operationalizes a multi-agent, role-based approach to diagnostic reasoning. A single LM is prompted to simulate five distinct clinical personas:

Dr. Hypothesis: Maintains and updates a probability-ranked differential diagnosis.
Dr. Test-Chooser: Selects high-yield diagnostic tests to discriminate among hypotheses.
Dr. Challenger: Identifies potential cognitive biases and proposes falsification strategies.
Dr. Stewardship: Advocates for cost-effective care and vetoes low-yield, expensive tests.
Dr. Checklist: Ensures internal consistency and validity of test requests.

The orchestrator enables consensus-driven decision-making, balancing diagnostic certainty with cost and resource stewardship. Multiple operational modes are explored, including budget-constrained and ensemble configurations, allowing navigation of the accuracy-cost Pareto frontier.

Empirical Results

The evaluation spans both human physicians (n=21, median 12 years experience) and a suite of state-of-the-art LMs (OpenAI, Gemini, Claude, Grok, DeepSeek, Llama). Key findings include:

Physician Baseline: Physicians achieved 19.9% accuracy at an average cost of $2,963 per case on SDBench, underscoring the benchmark's difficulty.
Off-the-shelf LMs: Performance varied; GPT-4o reached 49.3% accuracy at $2,745/case, while o3 achieved 78.6% at$7,850/case, indicating a trade-off between accuracy and cost.
MAI-DxO Performance: When paired with o3, MAI-DxO achieved 80% accuracy (4x higher than physicians) at a 20% lower cost than physicians and a 70% reduction compared to off-the-shelf o3. The ensemble configuration reached 85.5% accuracy at $7,184/case.
Model-Agnostic Gains: MAI-DxO consistently improved diagnostic accuracy and/or cost efficiency across all tested LMs, with particularly pronounced gains for weaker models.

The robustness of these results was confirmed on a held-out test set of 56 recent NEJM cases, published after the training cut-off of the evaluated models, mitigating concerns about memorization or overfitting.

Implications

Practical Implications:

Clinical Decision Support: The orchestration framework demonstrates that structured, multi-agent prompting can substantially enhance both the accuracy and efficiency of LM-driven diagnostic agents, surpassing experienced physicians on challenging cases.
Cost-Conscious AI: Explicit modeling of diagnostic cost and information value is critical for real-world deployment, especially in resource-constrained settings.
Model-Agnostic Deployment: The system's independence from any single LM backend reduces the need for continual re-engineering as new models are released, facilitating sustainable integration into clinical workflows.

Theoretical Implications:

Beyond Static Benchmarks: SDBench exposes limitations of static, vignette-based evaluations and provides a more rigorous test of sequential reasoning, information gathering, and decision-making under uncertainty.
Cognitive Modeling: The virtual panel approach operationalizes key aspects of human clinical reasoning, such as hypothesis management, adversarial challenge, and stewardship, offering a blueprint for future AI systems that emulate team-based medical practice.

Limitations

Case Distribution: The NEJM CPC cases are skewed toward rare and complex diagnoses, limiting generalizability to routine clinical practice and precluding assessment of false positive rates.
Cost Estimation: The use of U.S.-centric cost data and omission of non-test-related costs (e.g., patient discomfort, time delays) constrain the fidelity of economic evaluation.
Physician Comparison: The paper design restricts physicians from using external resources, which may underestimate their real-world performance.

Future Directions

Real-World Validation: Prospective studies in everyday clinical environments are needed to assess generalizability and clinical impact.
Expanded Benchmarks: Development of diagnostic corpora reflecting real-world prevalence and case mix will be essential for comprehensive evaluation.
Educational Applications: The interactive, synthetic findings framework could be leveraged for medical education and training, providing adaptive, AI-guided simulation environments.
Multimodal Integration: Incorporating imaging and other sensory modalities may further enhance diagnostic performance and realism.

Conclusion

This work establishes a new standard for evaluating and optimizing AI-driven diagnostic agents, demonstrating that structured orchestration of LMs can achieve high diagnostic accuracy and cost efficiency on complex clinical cases. The SDBench benchmark and MAI-DxO system provide a robust foundation for future research and deployment of AI in healthcare, with significant implications for both clinical practice and the development of cognitively inspired AI systems.

PDF Markdown

Related Papers

Tweets

https://twitter.com/EricTopol/status/1939676724792643746

https://twitter.com/NTFabiano/status/1939716110871388475

https://twitter.com/Bob_Wachter/status/1940063361670291624

https://twitter.com/rpnickson/status/1939680112468767194

https://twitter.com/_akhaliq/status/1939496281401119181

https://twitter.com/atranscendedman/status/1939750732019769555

YouTube

Show All Videos

HackerNews

Sequential Diagnosis with Language Models (4 points, 0 comments)
Sequential Diagnosis with Language Models (2 points, 1 comment)
Microsoft: Sequential Diagnosis with Language Models (2 points, 0 comments)