- The paper introduces the MAP multi-agent framework and the IPDS benchmark derived from MIMIC-IV to evaluate LLM performance on the sequential Triage-Diagnosis-Treatment workflow in inpatient care.
- MAP achieved a 78.10% diagnosis accuracy on IPDS, representing a 25.10% absolute improvement over the strongest baseline LLM and showing statistically significant higher accuracy than human clinicians on a subset.
- The framework enhances LLMs through a multi-agent structure, a Record Review module for data filtering, a trainable retrieval-enhanced generation module for better reasoning, and an Expert Guidance module for simulated supervision.
The paper "MAP: Evaluation and Multi-Agent Enhancement of LLMs for Inpatient Pathways" (2503.13205) introduces a multi-agent framework (MAP) designed to enhance LLM performance in simulating clinical decision-making for inpatient pathways, specifically addressing the Triage-Diagnosis-Treatment (TDT) workflow. It also presents the Inpatient Pathway Decision Support (IPDS) benchmark, derived from MIMIC-IV, to facilitate evaluation in this domain.
The IPDS Benchmark
A significant challenge in applying AI to inpatient care is the lack of large-scale, pathway-oriented datasets. To address this, the IPDS benchmark was constructed using data from MIMIC-IV (specifically MIMIC-IV-Hosp, MIMIC-IV-ICU, MIMIC-IV-Note). IPDS comprises 51,274 patient cases, each containing curated information essential for inpatient decision-making, including demographics, radiological reports (extracted via regular expressions from free-text notes), and structured medical history.
The benchmark is structured around three sequential classification tasks mirroring the inpatient journey:
- Triage: Assigning a patient to one of 9 clinical departments (e.g., CVICU, MICU, SICU) based on initial presentation and urgency cues often found in radiological reports.
- Diagnosis: Determining the primary disease from 17 major categories (derived by reclassifying 1,298 ICD codes based on international standards) using comprehensive patient data available post-admission.
- Treatment: Selecting an appropriate treatment plan from 16 standardized options (e.g., Vascular surgery, General medical treatment) based on the outcomes of the triage and diagnosis stages.
IPDS distinguishes itself from prior medical benchmarks (e.g., MedQA, PubMedQA) by focusing explicitly on the sequential TDT workflow within an inpatient context, using realistic EHR-derived data inputs rather than exam-style questions or isolated Q&A pairs.
The MAP Framework Architecture
The MAP framework employs a multi-agent system built upon LLaMA-3-8B as the foundational LLM for each agent. It aims to simulate the collaborative and sequential nature of clinical teams managing inpatient care. The framework consists of four distinct agents:
- Triage Agent: Responsible for the initial patient assessment and departmental assignment. It primarily analyzes symptoms, medical history, and radiological reports to determine the appropriate admission department.
- Diagnosis Agent: Functions as the core decision-maker within the assigned department. It leverages comprehensive patient information (demographics, history, radiology, physical exams) and employs Chain-of-Thought (CoT) reasoning to arrive at a diagnosis.
- Treatment Agent: Receives information from the preceding agents (Triage, Diagnosis) along with patient data to formulate a suitable treatment plan.
- Chief Agent: Operates primarily during the training phase. It supervises the clinician agents, evaluates their outputs against ground truth and clinical guidelines (from the knowledge base), provides structured feedback, and guides refinement, particularly for the Diagnosis Agent's reasoning process via the Expert Guidance Module.
These agents are supported by three crucial modules designed to enhance data processing and reasoning:
- Record Review Module: Pre-processes patient input data. It utilizes ClinicalBERT embeddings to compute the cosine similarity between medical history entries and the corresponding radiological report. This allows filtering of irrelevant or contradictory historical information, aiming to improve the signal-to-noise ratio of the input provided to the agents.
- Trainable Retrieval-Enhanced Generation (REG) Module: Augments the Diagnosis Agent's reasoning. It integrates a knowledge base containing real patient cases from IPDS and NICE guidelines. Using LlamaIndex, it performs semantic retrieval to fetch relevant cases or guideline snippets based on the current patient's profile. This retrieved context is incorporated into the prompt alongside a CoT template. Crucially, this REG module is trainable; it learns during fine-tuning to better filter and utilize the retrieved information, reducing the impact of potentially noisy or irrelevant retrieved data.
- Expert Guidance Module: Implemented during MAP's training phase. The Chief Agent assesses the Diagnosis Agent's intermediate reasoning steps against the knowledge base (guidelines, similar cases). It identifies logical gaps, overlooked evidence, or deviations from standards, providing corrective feedback to refine the agent's decision-making process. This simulates expert clinical supervision.
Agent communication follows a structured format including 'context', 'thinking' (capturing the reasoning trace), and 'answer' fields, promoting transparency and facilitating analysis of the decision pathway.
The effectiveness of MAP was evaluated on the IPDS benchmark against several baseline LLMs, including general models (LLaMA-3-8B) and medical domain-specific models (HuatuoGPT2-13B, Meditron-70B, Clinical-Camel-70B).
Baseline LLMs demonstrated suboptimal performance on the IPDS tasks, with diagnosis accuracies ranging from 49.30% (LLaMA-3-8B) to 53.00% (HuatuoGPT2-13B). Meditron-70B achieved 50.90%, and Clinical-Camel-70B reached 47.50%. These results were deemed "unsatisfying," particularly for complex diagnostic categories.
The MAP framework exhibited substantially improved performance:
- Diagnosis Accuracy: MAP achieved a diagnosis accuracy of 78.10%.
- Relative Improvement: This represents a 25.10% absolute improvement (p<0.001) compared to the strongest baseline, HuatuoGPT2-13B. Significant gains were also observed over LLaMA-3-8B (+28.80%), Meditron-70B (+27.20%), and Clinical-Camel-70B (+30.60%).
- Task-Specific Performance: MAP outperformed baselines across all three tasks (Triage, Diagnosis, Treatment).
A key aspect of the evaluation involved assessing clinical compliance and comparing MAP against human experts. Three board-certified clinicians (5-15 years of experience) evaluated a random subset of 100 cases from the IPDS test set.
- Accuracy vs. Clinicians: MAP demonstrated statistically significant higher accuracy compared to the clinicians on these 100 cases, achieving 10%-12% higher accuracy (p=0.0067).
- Agreement (Intraclass Correlation Coefficient - ICC):
- MAP showed excellent agreement with the ground truth labels (ICC = 0.81).
- The clinicians exhibited good agreement with the ground truth, but lower than MAP (ICC range [0.67, 0.68]). Note: The paper text mentions one clinician reached 0.80 agreement in a specific analysis, but the primary reported range from the figure caption is lower.
- MAP also maintained strong agreement with the individual clinicians' judgments (ICC range [0.75, 0.84]).
- Error Analysis: MAP showed a more distributed pattern of misdiagnoses across disease categories compared to the best-performing clinician, whose errors tended to cluster in specific difficult categories (D1: infectious diseases, D17: symptoms/signs not elsewhere classified). MAP also exhibited fewer false positives in its predictions.
The ablation studies confirmed the contribution of each component (multi-agent structure, Record Review, REG, Expert Guidance), with the full MAP configuration yielding the best results. The trainable nature of the REG module was shown to be particularly beneficial over standard REG.
Conclusion
The MAP framework demonstrates that a multi-agent LLM system, designed to mirror clinical workflows and augmented with specialized modules for data filtering (Record Review), retrieval-augmented reasoning (Trainable REG), and simulated expert supervision (Expert Guidance), can significantly improve performance on complex inpatient pathway tasks compared to monolithic LLMs. The introduction of the IPDS benchmark provides a valuable resource for evaluating such systems. The finding that MAP outperformed board-certified clinicians in accuracy on the benchmark subset underscores the potential utility of structured AI systems in supporting clinical decision-making within inpatient settings, although further validation in real-world clinical environments is necessary.