- The paper presents MultiDx, a two-stage framework that integrates diverse evidence sources to produce interpretable diagnostic reasoning.
- It employs SOAP structuring, hierarchical retrieval, and dynamic web search combined with voting and differential diagnosis to align with clinical workflows.
- Empirical results demonstrate significant improvements with reasoning recall up to 0.665 and enhanced Hit@ metrics across multiple clinical datasets.
MultiDx: Integrating Multi-Source Knowledge for Diagnostic Reasoning
Problem Motivation and Clinical Context
Diagnostic reasoning in medicine requires not only accurate prediction but robust explainability—structured clinical inference grounded in multi-modal evidence is essential for clinician trust and patient safety. LLMs have advanced commonsense and mathematical reasoning, but they often underperform in clinical scenarios due to domain knowledge insufficiency, rigid internalization of static knowledge bases, and poor alignment with verified reasoning paths. Standard evaluation metrics have prioritized accuracy while neglecting proof alignment with clinical reasoning standards, reducing interpretability and practical utility.
Figure 1: Example showing diagnosis reasoning, emphasizing the interpretability and traceability of diagnostic predictions.
MultiDx Framework Architecture
MultiDx introduces a two-stage diagnostic reasoning pipeline that explicitly incorporates evidence from heterogeneous sources: web search, SOAP-formatted clinical case reports, clinician-annotated case databases, and fine-grained retrieval of case reasoning traces. The framework provides interpretable reasoning paths and integrates these via evidence matching, voting, and differential diagnosis, aligning with the diagnostic workflow of generating a suspected disease list followed by granular differentiation.
Figure 2: The overall architecture of MultiDx, depicting multi-source evidence retrieval and integration followed by differential diagnosis.
Stage 1: Multi-Source Knowledge-Guided Diagnosis Generation
- SOAP Structuring: Raw case reports (often unstructured) are parsed into Subjective, Objective, Assessment, and Plan categories using LLMs. This standardizes input, facilitates downstream reasoning, and improves evidence modularity (Equation 1).
- Case Database Retrieval: The system leverages hierarchical retrieval (top-k similar cases via BM25, top-k reasoning steps via biomedical entity overlap), enabling both high-level and granular diagnostic evidence injection. These modules allow alignment with authentic clinical workflows and support for rare or ambiguous cases.
- Web Search Module: Dynamic queries and tool usage plans (search, navigation, extraction) are generated and executed iteratively, updating internal memory (Equation 7) and retrieving up-to-date external medical knowledge, crucial for addressing out-of-distribution cases and rapid domain evolution.
Stage 2: Evidence Integration and Differential Diagnosis
MultiDx performs disease term harmonization, aggregates cross-source support via voting, and applies clinical logic-based differential diagnosis on top-ranked hypotheses. The LLM is explicitly prompted to analyze, promote, or demote candidate diseases based on fit to the case, integrating multi-source evidence into coherent reasoning trajectories and ranked lists.
Experimental Evaluation
Datasets and Baselines
Evaluation was conducted on MedCaseReasoning and DiReCT, both containing detailed human-annotated clinical reasoning statements and diagnostic QA cases. Comparative baselines include base and fine-tuned LLMs, as well as recent agentic methods (Self-Refinement, MedAgents, OpenAI-DR).
Key Numerical Results
- On MedCaseReasoning, MultiDx achieved reasoning recall of 0.662 and Hit@5/10 accuracy of 0.577/0.617, outperforming DeepSeek-R1 (0.419/0.442), and OpenAI-DR (0.553/0.602).
- On DiReCT, MultiDx yielded reasoning recall of 0.665, Hit@10 accuracy of 0.587, surpassing all agentic baselines.
- Integration of multiple knowledge sources (Table: Ablation) consistently improved accuracy and recall versus single-source variants.
- Compatibility tests demonstrated MultiDx’s effectiveness across backbone models (e.g., Qwen3-14B), outperforming agentic methods in all metrics.
- For unseen diseases, web search-based modules demonstrated strong generalization, with MultiDx achieving Hit@1/5 of 0.338/0.448 in unseen settings.
Case Study and Qualitative Analysis
The reasoning paths and ranked lists from different modules demonstrate MultiDx's ability to consolidate diverse evidence and align with expert clinical decision-making. For a complex CNS case, MultiDx correctly prioritized primary CNS lymphoma in both reasoning path and ranking, providing detailed justification congruent with expert ground truth. Modules contributed complementary hypotheses and exclusion logic, enhancing robustness.
Computational Efficiency and Practical Modularity
MultiDx operates with competitive token usage and latency compared to agentic baselines, with Stage 1 modules parallelizable and configurable. This enables flexible trade-offs between diagnostic quality and computational cost, critical for deployment in varied healthcare environments.
MultiDx addresses limitations identified in prior agentic frameworks (MedAgents, ConfAgents, MedAgent-Pro, OpenAI-DR), specifically their inability to dynamically integrate diverse knowledge sources and generate reasoning trajectories aligned with clinical standards. Retrieval-enhanced generation, fine-grained entity-based alignment, and real-time web search integration represent key methodological improvements.
Theoretical Implications and Future Directions
The explicit alignment with clinical workflows, multi-perspective evidence aggregation, and modular diagnostic reasoning pipeline highlight the increasing necessity for interpretable and evidence-grounded AI in healthcare. Future developments may include joint optimization of extraction and differential diagnosis stages, advanced entity harmonization, integration of multimodal (e.g., imaging, laboratory) evidence, and further advances in provenance-tracking for clinical verifiability.
Conclusion
MultiDx provides a modular, interpretable, and practically robust two-stage diagnostic reasoning framework, integrating multi-source knowledge and producing coherent reasoning trajectories that align with medical norms. Empirical results demonstrate significant improvements in both diagnostic accuracy and reasoning recall, with strong adaptability to unseen clinical scenarios and generalizability across LLM backbones. The approach sets a rigorous standard for AI-assisted diagnostic systems requiring structured, verifiable clinical support.