MAI-DxO: Diagnostic Orchestration Framework

Updated 1 July 2025

MAI-DxO is a model-agnostic orchestration framework that emulates a clinical expert panel through specialized agent roles to enhance diagnostic accuracy and cost efficiency.
It utilizes an iterative evidence-gathering methodology that mirrors clinical reasoning by dynamically updating differential diagnoses and test selection.
The system achieves up to 85.5% diagnostic accuracy while significantly lowering costs, outperforming traditional and off-the-shelf diagnostic approaches.

The Multiple Access Interference Diagnostic Orchestrator (MAI-DxO) is a model-agnostic orchestration framework designed to emulate the diagnostic workflow of a team of medical specialists, with the primary aim of enhancing diagnostic accuracy and cost-effectiveness in clinical medicine. At its core, MAI-DxO leverages a structured, iterative, and multi-agent approach to information gathering and test selection, which distills the collective intelligence of a virtual physician panel while controlling resource utilization. MAI-DxO interfaces with state-of-the-art LLMs from various families (including OpenAI, Gemini, Claude, Grok, DeepSeek, and Llama), coordinating their outputs to match or surpass the diagnostic precision of expert clinicians, while mitigating excessive diagnostic costs.

1. Orchestration of a Virtual Diagnostic Panel

MAI-DxO is explicitly designed to simulate the nuanced, iterative reasoning processes of a medical physician panel. Rather than relying on a single model output or static case vignette, MAI-DxO instantiates five prototypical clinical personas:

Dr. Hypothesis: Maintains and Bayesian-updates a ranked differential diagnosis.
Dr. Test-Chooser: Selects the most discriminative and high-yield tests.
Dr. Challenger: Functions as a devil’s advocate, seeking contradictory evidence and challenging early closure or anchoring bias.
Dr. Stewardship: Advocates for cost-effective and high-value testing, vetoing unnecessary expenses.
Dr. Checklist: Ensures orders are valid and maintains internal logic.

Through prompt orchestration, these agents interact—each fulfilling a specialized cognitive role—before their suggestions are synthesized into a single, stepwise clinical action. This architecture emulates panel consensus and encourages exploration of alternative hypotheses, correction of cognitive errors, and resource stewardship.

2. Sequential Diagnostic Methodology

MAI-DxO mirrors real clinical reasoning by engaging in an iterative, evidence-gathering process. Starting with a brief clinical vignette (analogous to an initial patient summary), the orchestrator must decide at each round to either:

Pose targeted questions for more clinical details,
Order diagnostic tests, or
Issue a final diagnosis once sufficiently confident.

A "gatekeeper model" releases new clinical findings only in response to explicit, stepwise information requests, closely approximating real clinical workflow constraints. After each new finding or test result, the panel re-evaluates the differential diagnosis (with Dr. Hypothesis updating the list Bayesianly), considers the current costs, and collectively determines the next high-yield, cost-appropriate step. Testing cost, physician visit cost, and cumulative resource usage are explicitly tracked, and ordering decisions are influenced not only by informational value but also by economic stewardship and internal panel debate.

3. Performance: Accuracy and Cost-Effectiveness

On the Sequential Diagnosis Benchmark (SDBench), MAI-DxO demonstrates substantial improvements over both expert human generalists and off-the-shelf LLMs:

Diagnostic accuracy: Using OpenAI’s o3 model as backend, MAI-DxO reaches 80% diagnostic accuracy (Likert $\geq 4$ /5), compared to 19.9% for generalist physicians and 78.6% for off-the-shelf o3.
Cost efficiency: MAI-DxO reduces diagnostic costs by 20% relative to physicians and by 70% relative to off-the-shelf o3 (budgeted mode: $2,396$ per case vs. $7,850$).
Maximum accuracy configuration: When maximized for accuracy, MAI-DxO achieves 85.5% accuracy.
Model-agnostic benefit: All tested LLM families benefit from orchestration, with consistent performance gains in both accuracy and cost metrics.

These performance measures are rigorously defined: accuracy is judged by a validated 5-point Likert scale reflecting clinical acceptability, and cost includes standardized visit and test fees. All gains are reported as statistically significant (permutation test, $p < 0.005$ ).

4. Model-Agnostic Orchestration Architecture

MAI-DxO is fundamentally decoupled from the underlying LLM, relying on prompt-based orchestration and agent role assignment rather than model-specific training or fine-tuning. This enables:

plug-and-play compatibility with a wide array of LLMs,
rapid adoption of advances in foundation model capabilities,
preservation of orchestrator logic and panel discipline regardless of backend model choice.

Empirically, orchestrator-induced gains are observed across OpenAI, Gemini, Claude, Grok, DeepSeek, and Llama families, with orchestration either lifting absolute accuracy/cost efficiency or preserving peak accuracy at much lower resource spend.

5. Strategic, Value-Based Test Ordering

MAI-DxO’s test selection process is distinguished by explicit, collaborative deliberation:

Dr. Test-Chooser recommends tests of maximal discriminatory value for the then-current differential.
Dr. Stewardship evaluates each suggestion for expected yield per cost, proposing lower-cost alternatives or vetoing unnecessary tests.
Dr. Challenger proposes broadening or redirecting diagnostic focus, while Dr. Checklist maintains logic and format compliance.

Test ordering is modulated in real time: before any test is finalized, MAI-DxO can consult a simulated budget to enforce hard or soft constraints (budgeted mode), proceed unbounded for accuracy maximization (no-budget mode), or restrict itself to purely question-based reasoning (question-only mode). This deliberate test selection has measurable outcomes: in challenging cases, MAI-DxO often achieves the correct diagnosis with fewer, higher-yield tests and substantially less cost than both human and naive AI baselines.

6. Impact and Implications in Clinical Care

MAI-DxO offers several salient contributions and implications for clinical care and future AI development:

Superhuman diagnostic accuracy is accompanied by significant cost savings, strictly improving the accuracy-cost Pareto frontier. For selected modes, MAI-DxO both quadruples physician accuracy and halves cost.
The orchestrated, panel-based reasoning structure serves as a bias-correcting framework, reducing anchoring error and premature closure that otherwise limit both AI and physician performance.
The model-agnostic approach ensures extensibility and future-proofing across evolving LLM technologies.
By simulating stepwise interactions and test-budget negotiations, MAI-DxO aligns closely with real clinical workflows and can serve as a foundation for benchmarking, education, and decision-support system design.
The orchestration technique mitigates model hallucinations/overconfidence by enforcing continuous reevaluation and consensus, a property prized in high-stakes clinical decision-making.

A plausible implication is that as orchestration frameworks like MAI-DxO become more widely embedded in clinical informatics infrastructure, structured, cost-effective, and explainable AI-supported diagnosis may become feasible at scale, both augmenting and standardizing care across heterogeneous healthcare environments.

Agent	Diagnostic Accuracy	Avg. Cost per Case	Relative to Physicians
Generalist Physicians	19.9%	$2,963	Baseline
Off-the-shelf o3	78.6%	$7,850 \| +58.7%, +$4,887
MAI-DxO (o3, standard)	81.9%	$4,735 \| +62.0%, +$1,772
MAI-DxO (o3, budget mode)	79.9%	$2,396 \| +60.0%, –$567
MAI-DxO (o3, ensemble)	85.5%	$7,184 \| +65.6%, +$4,221

References: All data and performance results derive from "Sequential Diagnosis with LLMs" (2506.22405).

PDF Markdown Chat (Upgrade)

References (1)

Sequential Diagnosis with Language Models (2025)