- The paper presents SDBench and MAI-DxO, a sequential diagnostic framework that simulates a virtual panel of physicians to markedly improve diagnostic accuracy and cost efficiency.
- The methodology employs a multi-agent orchestration where language models simulate distinct clinical roles to iteratively gather and validate patient information.
- Empirical results demonstrate significant gains over traditional methods, with LM-driven approaches achieving up to 80% accuracy and substantially lower costs compared to clinician baselines.
Sequential Diagnosis with LLMs: An Expert Overview
The paper "Sequential Diagnosis with LLMs" (2506.22405) presents a comprehensive framework for evaluating and improving the diagnostic reasoning capabilities of LLMs (LMs) in clinical settings. The authors introduce the Sequential Diagnosis Benchmark (SDBench), a novel evaluation suite that transforms 304 New England Journal of Medicine (NEJM) clinicopathological conference (CPC) cases into interactive, stepwise diagnostic encounters. They further propose the MAI Diagnostic Orchestrator (MAI-DxO), a model-agnostic orchestration system that simulates a virtual panel of physicians to optimize both diagnostic accuracy and cost-effectiveness.
Benchmark Design and Methodology
SDBench is constructed to emulate the iterative, information-seeking process characteristic of real-world clinical diagnosis. Each case begins with a brief patient vignette, after which the diagnostic agent (human or AI) must iteratively request additional information—either by asking questions or ordering diagnostic tests—before committing to a final diagnosis. The information flow is mediated by a Gatekeeper agent, implemented as a LLM with access to the full case file, which only reveals findings when explicitly queried. This design enforces a realistic constraint on information access and prevents leakage of diagnostic clues.
A key innovation is the Gatekeeper's ability to synthesize plausible, case-consistent findings for queries not covered in the original CPC narrative, thereby maintaining clinical realism and avoiding implicit signaling from missing data. The evaluation protocol incorporates both diagnostic accuracy (using a clinically validated, rubric-based Judge agent) and cumulative diagnostic cost, the latter estimated via CPT code mapping and U.S. health system pricing data.
MAI Diagnostic Orchestrator (MAI-DxO)
MAI-DxO operationalizes a multi-agent, role-based approach to diagnostic reasoning. A single LM is prompted to simulate five distinct clinical personas:
- Dr. Hypothesis: Maintains and updates a probability-ranked differential diagnosis.
- Dr. Test-Chooser: Selects high-yield diagnostic tests to discriminate among hypotheses.
- Dr. Challenger: Identifies potential cognitive biases and proposes falsification strategies.
- Dr. Stewardship: Advocates for cost-effective care and vetoes low-yield, expensive tests.
- Dr. Checklist: Ensures internal consistency and validity of test requests.
The orchestrator enables consensus-driven decision-making, balancing diagnostic certainty with cost and resource stewardship. Multiple operational modes are explored, including budget-constrained and ensemble configurations, allowing navigation of the accuracy-cost Pareto frontier.
Empirical Results
The evaluation spans both human physicians (n=21, median 12 years experience) and a suite of state-of-the-art LMs (OpenAI, Gemini, Claude, Grok, DeepSeek, Llama). Key findings include:
- Physician Baseline: Physicians achieved 19.9% accuracy at an average cost of $2,963 per case on SDBench, underscoring the benchmark's difficulty.
- Off-the-shelf LMs: Performance varied; GPT-4o reached 49.3% accuracy at $2,745/case, while o3 achieved 78.6% at$7,850/case, indicating a trade-off between accuracy and cost.
- MAI-DxO Performance: When paired with o3, MAI-DxO achieved 80% accuracy (4x higher than physicians) at a 20% lower cost than physicians and a 70% reduction compared to off-the-shelf o3. The ensemble configuration reached 85.5% accuracy at $7,184/case.
- Model-Agnostic Gains: MAI-DxO consistently improved diagnostic accuracy and/or cost efficiency across all tested LMs, with particularly pronounced gains for weaker models.
The robustness of these results was confirmed on a held-out test set of 56 recent NEJM cases, published after the training cut-off of the evaluated models, mitigating concerns about memorization or overfitting.
Implications
Practical Implications:
- Clinical Decision Support: The orchestration framework demonstrates that structured, multi-agent prompting can substantially enhance both the accuracy and efficiency of LM-driven diagnostic agents, surpassing experienced physicians on challenging cases.
- Cost-Conscious AI: Explicit modeling of diagnostic cost and information value is critical for real-world deployment, especially in resource-constrained settings.
- Model-Agnostic Deployment: The system's independence from any single LM backend reduces the need for continual re-engineering as new models are released, facilitating sustainable integration into clinical workflows.
Theoretical Implications:
- Beyond Static Benchmarks: SDBench exposes limitations of static, vignette-based evaluations and provides a more rigorous test of sequential reasoning, information gathering, and decision-making under uncertainty.
- Cognitive Modeling: The virtual panel approach operationalizes key aspects of human clinical reasoning, such as hypothesis management, adversarial challenge, and stewardship, offering a blueprint for future AI systems that emulate team-based medical practice.
Limitations
- Case Distribution: The NEJM CPC cases are skewed toward rare and complex diagnoses, limiting generalizability to routine clinical practice and precluding assessment of false positive rates.
- Cost Estimation: The use of U.S.-centric cost data and omission of non-test-related costs (e.g., patient discomfort, time delays) constrain the fidelity of economic evaluation.
- Physician Comparison: The paper design restricts physicians from using external resources, which may underestimate their real-world performance.
Future Directions
- Real-World Validation: Prospective studies in everyday clinical environments are needed to assess generalizability and clinical impact.
- Expanded Benchmarks: Development of diagnostic corpora reflecting real-world prevalence and case mix will be essential for comprehensive evaluation.
- Educational Applications: The interactive, synthetic findings framework could be leveraged for medical education and training, providing adaptive, AI-guided simulation environments.
- Multimodal Integration: Incorporating imaging and other sensory modalities may further enhance diagnostic performance and realism.
Conclusion
This work establishes a new standard for evaluating and optimizing AI-driven diagnostic agents, demonstrating that structured orchestration of LMs can achieve high diagnostic accuracy and cost efficiency on complex clinical cases. The SDBench benchmark and MAI-DxO system provide a robust foundation for future research and deployment of AI in healthcare, with significant implications for both clinical practice and the development of cognitively inspired AI systems.