MultiDx: A Multi-Source Knowledge Integration Framework towards Diagnostic Reasoning

Published 27 Apr 2026 in cs.CL and cs.AI | (2604.24186v1)

Abstract: Diagnostic prediction and clinical reasoning are critical tasks in healthcare applications. While LLMs have shown strong capabilities in commonsense reasoning, they still struggle with diagnostic reasoning due to limited domain knowledge. Existing approaches often rely on internal model knowledge or static knowledge bases, resulting in knowledge insufficiency and limited adaptability, which hinder their capacity to perform diagnostic reasoning. Moreover, these methods focus solely on the accuracy of final predictions, overlooking alignment with standard clinical reasoning trajectories. To this end, we propose MultiDx, a two-stage diagnostic reasoning framework that performs differential diagnosis by analyzing evidence collected from multiple knowledge sources. Specifically, it first generates suspected diagnoses and reasoning paths by leveraging knowledge from web search, SOAP-formatted case, and clinical case database. Then it integrates multi-perspective evidence through matching, voting, and differential diagnosis to generate the final prediction.~Extensive experiments on two public benchmarks demonstrate the effectiveness of our approach.

Abstract PDF Upgrade to Chat

Authors (12)

Summary

The paper presents MultiDx, a two-stage framework that integrates diverse evidence sources to produce interpretable diagnostic reasoning.
It employs SOAP structuring, hierarchical retrieval, and dynamic web search combined with voting and differential diagnosis to align with clinical workflows.
Empirical results demonstrate significant improvements with reasoning recall up to 0.665 and enhanced Hit@ metrics across multiple clinical datasets.

MultiDx: Integrating Multi-Source Knowledge for Diagnostic Reasoning

Problem Motivation and Clinical Context

Diagnostic reasoning in medicine requires not only accurate prediction but robust explainability—structured clinical inference grounded in multi-modal evidence is essential for clinician trust and patient safety. LLMs have advanced commonsense and mathematical reasoning, but they often underperform in clinical scenarios due to domain knowledge insufficiency, rigid internalization of static knowledge bases, and poor alignment with verified reasoning paths. Standard evaluation metrics have prioritized accuracy while neglecting proof alignment with clinical reasoning standards, reducing interpretability and practical utility.

Figure 1: Example showing diagnosis reasoning, emphasizing the interpretability and traceability of diagnostic predictions.

MultiDx Framework Architecture

MultiDx introduces a two-stage diagnostic reasoning pipeline that explicitly incorporates evidence from heterogeneous sources: web search, SOAP-formatted clinical case reports, clinician-annotated case databases, and fine-grained retrieval of case reasoning traces. The framework provides interpretable reasoning paths and integrates these via evidence matching, voting, and differential diagnosis, aligning with the diagnostic workflow of generating a suspected disease list followed by granular differentiation.

Figure 2: The overall architecture of MultiDx, depicting multi-source evidence retrieval and integration followed by differential diagnosis.

Stage 1: Multi-Source Knowledge-Guided Diagnosis Generation

SOAP Structuring: Raw case reports (often unstructured) are parsed into Subjective, Objective, Assessment, and Plan categories using LLMs. This standardizes input, facilitates downstream reasoning, and improves evidence modularity (Equation 1).
Case Database Retrieval: The system leverages hierarchical retrieval (top-k similar cases via BM25, top-k reasoning steps via biomedical entity overlap), enabling both high-level and granular diagnostic evidence injection. These modules allow alignment with authentic clinical workflows and support for rare or ambiguous cases.
Web Search Module: Dynamic queries and tool usage plans (search, navigation, extraction) are generated and executed iteratively, updating internal memory (Equation 7) and retrieving up-to-date external medical knowledge, crucial for addressing out-of-distribution cases and rapid domain evolution.

Stage 2: Evidence Integration and Differential Diagnosis

MultiDx performs disease term harmonization, aggregates cross-source support via voting, and applies clinical logic-based differential diagnosis on top-ranked hypotheses. The LLM is explicitly prompted to analyze, promote, or demote candidate diseases based on fit to the case, integrating multi-source evidence into coherent reasoning trajectories and ranked lists.

Experimental Evaluation

Datasets and Baselines

Evaluation was conducted on MedCaseReasoning and DiReCT, both containing detailed human-annotated clinical reasoning statements and diagnostic QA cases. Comparative baselines include base and fine-tuned LLMs, as well as recent agentic methods (Self-Refinement, MedAgents, OpenAI-DR).

Key Numerical Results

On MedCaseReasoning, MultiDx achieved reasoning recall of 0.662 and Hit@5/10 accuracy of 0.577/0.617, outperforming DeepSeek-R1 (0.419/0.442), and OpenAI-DR (0.553/0.602).
On DiReCT, MultiDx yielded reasoning recall of 0.665, Hit@10 accuracy of 0.587, surpassing all agentic baselines.
Integration of multiple knowledge sources (Table: Ablation) consistently improved accuracy and recall versus single-source variants.
Compatibility tests demonstrated MultiDx’s effectiveness across backbone models (e.g., Qwen3-14B), outperforming agentic methods in all metrics.
For unseen diseases, web search-based modules demonstrated strong generalization, with MultiDx achieving Hit@1/5 of 0.338/0.448 in unseen settings.

Case Study and Qualitative Analysis

The reasoning paths and ranked lists from different modules demonstrate MultiDx's ability to consolidate diverse evidence and align with expert clinical decision-making. For a complex CNS case, MultiDx correctly prioritized primary CNS lymphoma in both reasoning path and ranking, providing detailed justification congruent with expert ground truth. Modules contributed complementary hypotheses and exclusion logic, enhancing robustness.

Computational Efficiency and Practical Modularity

MultiDx operates with competitive token usage and latency compared to agentic baselines, with Stage 1 modules parallelizable and configurable. This enables flexible trade-offs between diagnostic quality and computational cost, critical for deployment in varied healthcare environments.

MultiDx addresses limitations identified in prior agentic frameworks (MedAgents, ConfAgents, MedAgent-Pro, OpenAI-DR), specifically their inability to dynamically integrate diverse knowledge sources and generate reasoning trajectories aligned with clinical standards. Retrieval-enhanced generation, fine-grained entity-based alignment, and real-time web search integration represent key methodological improvements.

Theoretical Implications and Future Directions

The explicit alignment with clinical workflows, multi-perspective evidence aggregation, and modular diagnostic reasoning pipeline highlight the increasing necessity for interpretable and evidence-grounded AI in healthcare. Future developments may include joint optimization of extraction and differential diagnosis stages, advanced entity harmonization, integration of multimodal (e.g., imaging, laboratory) evidence, and further advances in provenance-tracking for clinical verifiability.

Conclusion

MultiDx provides a modular, interpretable, and practically robust two-stage diagnostic reasoning framework, integrating multi-source knowledge and producing coherent reasoning trajectories that align with medical norms. Empirical results demonstrate significant improvements in both diagnostic accuracy and reasoning recall, with strong adaptability to unseen clinical scenarios and generalizability across LLM backbones. The approach sets a rigorous standard for AI-assisted diagnostic systems requiring structured, verifiable clinical support.

Markdown Report Issue