MedConsultBench: Medical Consultation Benchmarks
- MedConsultBench is a comprehensive suite of benchmarks that evaluates medical consultation agents through structured, multi-phase clinical workflows.
- It employs granular Atomic Information Units and dynamic patient simulators to assess diagnostic reasoning, treatment safety, and adaptive follow-up.
- The framework integrates diverse data sources and dense metrics to bridge the gap between static accuracy and real-time clinical decision-making.
MedConsultBench is a suite of process-aware, multi-phase benchmarks designed to rigorously evaluate medical consultation agents, particularly LLMs and multi-modal LLMs (MLLMs), across the entire clinical workflow. It integrates structured patient simulators, dynamic interaction paradigms, and dense metrics that capture both static accuracy and real-time information-gathering logic, diagnostic reasoning, treatment safety, and adaptive follow-up. Distinct variants have been published emphasizing fine-grained process fidelity (Qiao et al., 19 Jan 2026), confidence estimation under evidence sufficiency (Ren et al., 22 Jan 2026), multimodal capabilities (Liu et al., 2024), and comprehensive multi-task medical dialogue evaluation (Chen et al., 2022). Collectively, MedConsultBench underpins research communities’ efforts to align medical AI systems with nuanced, safety-critical requirements of real-world clinical care.
1. Foundational Frameworks and Formal Concepts
MedConsultBench is defined by two central architectural innovations. The first is the decomposition of clinical cases into Atomic Information Units (AIUs), a granular information representation where each in denotes a self-contained fact (e.g., “fever for 3 days”, “history of hypertension”), annotated with metadata (symptom/sign/history), diagnostic/safety relevance, and “red-flag” status (Qiao et al., 19 Jan 2026). AIUs facilitate sub-turn tracking of information acquisition during history taking: at each turn, the patient simulator reveals only in response to compatible structured queries.
Each possible diagnosis is paired with a minimal necessary information set , which anchors the evaluation of whether the agent elicits all essential evidence prior to decision making. The union across gold diagnoses provides case-specific completion targets.
The second pillar is a comprehensive, multi-phase consultation workflow, progressing through history-taking, diagnosis, treatment planning, and constraint-driven follow-up Q&A. Interactions are tightly controlled by a protocol: patient simulators respond only to precise queries and incrementally reveal evidence, supporting granular evaluation of inquiry logic and adaptive reasoning (Qiao et al., 19 Jan 2026, Liu et al., 2024, Chen et al., 2022).
2. Data Sources and Consultation Scenarios
MedConsultBench unifies multiple data sources and formats to enable realistic evaluation:
- Large-scale real consultations: Utilizes 35,792 de-identified online text dialogues from 17 specialties, providing raw interaction data for scenario induction and template clustering (Qiao et al., 19 Jan 2026).
- Synthetic and annotated corpora: Includes DDXPlus (10,000+ structured synthetic reports), MediTOD (1,000+ real doctor-patient history-taking dialogues), and MedQA (12,723 medical exam-style MCQs reformulated as free-text diagnostic tasks) (Ren et al., 22 Jan 2026).
- IMCS-21 corpus: 4,116 pediatric dialogues with multi-level annotation, supporting tasks from named entity recognition (NER) and dialogue act classification (DAC) to symptom label inference (SLI) and diagnosis-oriented policy learning (Chen et al., 2022).
- Med-PMC: 30 real-world surgical consultation cases (ages 15–81) incorporating structured text, laboratory results, and radiology images, supporting multimodal scenario evaluation (Liu et al., 2024).
Simulated patient agents are engineered for controlled evidence delivery. Architectures split into state trackers for action classification, response generators for fact retrieval or ambiguity handling, and persona modules for naturalistic, personalized dialogue (covering 10 actor templates by profession and gender) (Liu et al., 2024).
3. Multi-Stage Consultation Cycle and Interaction Paradigms
MedConsultBench operationalizes the medical consultation process as four or five sequential stages:
- History Taking: Agents pose queries mapped to canonical templates, inducing incremental evidence release as AIUs (Qiao et al., 19 Jan 2026). In MedPMC, “Ask-First-Observe-Next” requires active interrogation before passive data or image acquisition (Liu et al., 2024).
- Diagnosis: Emission of free-text differentials, ranked lists , and rationales conditioned on accumulated evidence (Ren et al., 22 Jan 2026, Qiao et al., 19 Jan 2026).
- Treatment Planning: Drug regimen proposals, with safety gatekeeping via a curated knowledge base of contraindications, DDIs, dosing limits; unsafe plans are penalized (Qiao et al., 19 Jan 2026).
- Follow-Up Q&A: Adaptive plan revision in response to patient-imposed constraints (e.g., cost, phobia); objective is dynamic constraint satisfaction at maintained plan quality (Qiao et al., 19 Jan 2026).
- (In IMCS-21 and MedPMC, medical report generation and explicit NLU/NLG tasks are also evaluated) (Chen et al., 2022, Liu et al., 2024).
In all MedConsultBench variants, multi-turn interaction is the norm. For confidence estimation, evidence units (dialogue turns or report sentences) are revealed in staged increments , prompting open-ended diagnosis and confidence generation at each stage (Ren et al., 22 Jan 2026).
4. Evaluation Metrics and Methodological Rigor
MedConsultBench employs an extensive suite of evaluation metrics:
Process-fidelity Metrics (12 Primary, 10 Secondary) (Qiao et al., 19 Jan 2026):
- MNI-Completion: $\mathrm{MNI\!-\!Comp} = \frac{|\mathcal{U}_{T_{\mathrm{dx}}\cap\mathcal{U}_{\mathrm{mni}}|}{|\mathcal{U}_{\mathrm{mni}}|} \times \frac{T_{\max}}{T_{\mathrm{dx}}}$
- Redundancy (): quantifies wasted queries outside MNI, weighted by clinical importance
- Information Gain Efficiency (IGE): , with denoting entropy on diagnostic posteriors
- Next-Question Hit@1, Precedence-Violation Rate (POVR): assesses inquiry logic structure
- Core-Diagnosis F1, Severity-Weighted Diagnostic Score (SWDS): diagnosis quality, ranked and weighted by clinical relevance
- Patient Safety Compliance (PSC), DDI Violation Rate (DDIV), Profile Conflict Rate (PCR): regimen safety against curated knowledge bases
- Follow-up Question Response (FQR), Dynamic Constraint-Satisfaction Ratio (DCSR): adaptation to patient needs, intent fulfillment
Confidence Estimation Metrics (Ren et al., 22 Jan 2026):
- Pearson/Spearman correlation coefficients () between model confidence and accuracy labels
- AUROC, AUPRC: discrimination between correct and incorrect predictions via confidence score thresholding
Information Gathering and Decision Recall (MedPMC) (Liu et al., 2024):
- Inquiry, examination, multi-modal analysis recall rates (ROUGE-based)
- Diagnosis and treatment recall (ROUGE-L, ROUGE-1)
- LLM-based subjective scoring on inquiry logic, image interpretation, decision quality
NLU/NLG Metrics (IMCS-21) (Chen et al., 2022):
- Named entity recognition: Precision/Recall/F1
- Dialogue act classification: Accuracy, macro F1
- Symptom label inference: Subset acceptance, hamming loss, macro F1
- Medical report generation: ROUGE, concept-F1, diagnosis accuracy
5. Experimental Results and Benchmark Insights
Comprehensive evaluation across 19 LLMs and 12 MLLMs reveals key findings:
- High static diagnosis accuracy does not predict successful information gathering: Gemini-3-Pro achieves up to but MNI-Completion remains , with logical questioning (POVR) at (Qiao et al., 19 Jan 2026).
- Safety compliance is nontrivial: best models exhibit DDIV 0.015–0.017 and PCR 0.04–0.06, indicating persistent regimen risks (Qiao et al., 19 Jan 2026).
- Follow-up adaptation is weakest; DCSR peaks at $0.49$ for LLMs vs. $0.81$ for human-process baselines (Qiao et al., 19 Jan 2026).
- Confidence estimation improves as information sufficiency increases: well-calibrated methods yield and Pearson correlations consistently outperform token-level baselines (Ren et al., 22 Jan 2026).
- Multimodal reasoning lags: even top MLLMs (GPT-4o) only reach 70% inquiry recall, 25% multi-modal analysis recall, and 30% for medical image interpretation. Persona simulation further reduces performance by 5–15 points (inquiry), 2–8 points (diagnosis) (Liu et al., 2024).
- IMCS-21 achieves high NER/DAC F1 (90%), but symptom label inference (macro F1 64%) and diagnosis-oriented dialogue policy (DX-Acc 56%) remain challenging (Chen et al., 2022).
6. Comparative Analysis, Limitations, and Recommendations
The MedConsultBench paradigm surpasses static, outcome-only medical NLU/NLG and VQA baselines by simulating dynamic evidence accumulation, constraining information flow, and stressing real-time reasoning and safety behaviors. Key advances include systematic multi-turn interaction, structured evidence gating, rigorously annotated gold standards, and dense process metrics.
Limitations persist:
- Major scenario focus is on general surgery, pediatrics, and selected specialties; expansion to cardiology, neurology, and cross-cultural settings is needed (Liu et al., 2024, Chen et al., 2022).
- Simulator fidelity and persona diversity are restricted; speech pattern, age, and cultural features could be extended (Liu et al., 2024).
- Multimodal depth is limited (images only on request, no video/streaming vitals) (Liu et al., 2024).
- Metric landscape may benefit from calibration error, precision, and F1 for probabilistic outputs (Liu et al., 2024).
- Human-in-the-loop evaluation remains an open avenue for soft-skill benchmarking and failure mode decomposition.
Recommended directions include:
- Process-aware training objectives, rewarding early essential information acquisition and penalizing improper inquiry (Qiao et al., 19 Jan 2026).
- Entropy-based uncertainty calibration and "thinking" prompts for structured reasoning (Qiao et al., 19 Jan 2026).
- Embedded external safety critics at inference and hard regimen-safety gating (Qiao et al., 19 Jan 2026).
- Simulator enrichment via stochastic patient behavior and multimodal cues (Qiao et al., 19 Jan 2026, Liu et al., 2024).
- Iterative audits using LLM judges for empathy and SDM, anchored in rule-based core metrics (Qiao et al., 19 Jan 2026).
7. Significance and Future Directions
MedConsultBench establishes a robust, extensible reference for medical AI evaluation, privileging process and safety as core requirements. Its multi-phase structure, meticulous information-tracking, and comprehensive metric design elucidate the theory–practice gap: memorization of medical knowledge does not guarantee operational competence in realistic, dynamic, safety-sensitive consultation settings. Future work is expected to expand disease coverage, modality depth, scenario realism, and soft-skill evaluation while developing process-aligned training and inference protocols. MedConsultBench is positioned as a rigorous foundation for the development and audit of consultation agents towards credible, safe, and context-adaptive clinical reasoning (Qiao et al., 19 Jan 2026, Ren et al., 22 Jan 2026, Liu et al., 2024, Chen et al., 2022).