DocCHA Framework for Diagnostic Dialogue
- The paper demonstrates that modular, confidence-guided reasoning in DocCHA improves diagnostic accuracy by up to 5.18% compared to conventional LLM pipelines.
- DocCHA is a framework that segments diagnostic consultations into symptom elicitation, history acquisition, and causal graph reasoning to mimic clinical reasoning.
- Its iterative confidence quantification mechanisms trigger dynamic clarifications, ensuring structured diagnoses while maintaining computational efficiency.
DocCHA is a modular, confidence-aware framework for LLM-driven online diagnostic dialogue, explicitly designed to emulate clinical reasoning. Unlike static Conversational Health Agents (CHAs), DocCHA systems dynamically prioritize symptom clarification, history acquisition, and explanatory causal reasoning through a staged, confidence-guided process, producing structured and transparent diagnostic consultations. Empirical evaluation on real-world Chinese datasets (IMCS21, DX) demonstrates that DocCHA, when instantiated with GPT-4o, achieves up to 5.18% higher diagnostic accuracy and over 30% greater information recall compared to prompting-based LLM pipelines, without incurring substantial increases in dialogue length (Liu et al., 10 Jul 2025).
1. System Architecture and Modular Pipeline
DocCHA is structured as a sequential, three-stage pipeline reflecting canonical clinical practice:
- Symptom Elicitation: Extraction and clarification of presented symptoms.
- History Acquisition: Collection of relevant patient history, exposures, and risk factors.
- Causal Graph Construction & Refinement: Reasoned chaining of symptoms and history into explanatory links leading to candidate diagnoses.
All modules interact via a shared global confidence state, which controls stage advancement and triggers iterative backtracking or refinements when uncertainty or missing data are detected. Downstream module confidence can force a return to prior stages, supporting adaptive multi-turn dialogue. The overall workflow enforces strict query budgets (maximum of four follow-ups per module, by default) for computational and user-efficiency.
Diagrammatic overview (text):
0
2. Confidence Quantification Mechanisms
Each module computes an interpretable, multi-component confidence score, combining coverage, detail, relevance, and model uncertainty as appropriate. A module proceeds to the next stage only if its confidence score meets a pre-set threshold; otherwise, it triggers additional evidence-seeking queries or clarifications.
2.1 Symptom Elicitation Module
Let denote extracted symptoms and candidate diagnoses. Key metrics:
- Discriminative Power:
- Coverage Confidence:
- Detail Confidence:
- Combined Symptom Confidence:
Low triggers question selection according to what most increases discriminative yield or fills missing attribute details.
2.2 History Acquisition Module
Let represent acquired history, and union all expected categories per candidate diagnosis. Key metrics:
- Coverage:
- Relevance: 0
- Certainty: 1
- Combined History Confidence: 2
Follow-up policy targets missing history categories or low-certainty statements.
2.3 Causal Graph Module
A directed graph 3 is built with 4 as symptoms, history, and diagnoses; 5 contains directed edges encoding hypothesized causal or temporal links. Scoring includes:
- Coherence (BARTScore): 6
- Medical Plausibility: 7
- Entailment (MedNLI): 8
- Combined Causal Confidence: 9
Edges with weak entailment or lacking UMLS support are targeted for clarification.
Example: Pipeline Pseudocode
1
3. Dialogue Policies and Iterative Reasoning
Symptom and history clarification policies are shaped by the discriminative utility of information and confidence shortfalls. Symptoms are prioritized by discriminative power, triggering distinct prompts for low detail versus low coverage:
- Low coverage: General prompts about unmentioned hallmark symptoms for differential diagnoses.
- Low detail: Attribute-specific follow-ups (e.g., onset, severity, duration).
History module targets under-documented categories (e.g., exposures, medications), enforcing hard query caps for budget-awareness. Both modules integrate seamlessly into the broader confidence loop.
The causal graph module uses LLM-driven chain-of-thought to explain candidate diagnoses via a directed graph; this structure is iteratively refined through clarification of uncertainty-heavy or medically unsupported links.
4. Performance Evaluation
DocCHA's efficacy is demonstrated via benchmarking against state-of-the-art LLM baselines, using two Chinese clinical consultation datasets: IMCS21 (100 pediatric cases) and DX (120 multi-turn adult cases), both with gold-standard diagnostic labels. The following metrics are reported:
- Accuracy (Acc.): Exact match to gold diagnosis.
- Cosine similarity (cos): Embedding-based label proximity.
- Information Recall (Recall_info): Proportion of gold-standard cues elicited.
- Average turns (n): Dialogue length efficiency.
Results Summary
| Method | Accuracy | Cosine | Recall_info | Turns |
|---|---|---|---|---|
| LLaMA-3 | 66.23 | 56.07 | 47.72% | 5.6 |
| GPT-3.5 | 87.69 | 62.20 | 52.04% | 6.4 |
| GPT-4o | 90.68 | 62.27 | 52.12% | 6.8 |
| DocCHA (GPT-4o) | 95.86 | 65.68 | 54.49% | 7.1 |
DocCHA exhibits a 5.18% accuracy gain over GPT-4o, with 2.37 percentage point improvement in information recall, and only a 0.3 increase in dialogue turns. On the DX dataset, accuracy reaches 94.14% (+4.18% over GPT-4o), with ~30% higher cue recall. Ablation experiments confirm critical contributions from each module: removing the symptom module reduces accuracy by ~2.4%, while removing graph reasoning costs ~1.6% (Liu et al., 10 Jul 2025).
5. Multilingual and Resource-Aware Operation
DocCHA supports Chinese by direct prompting in Chinese or with bidirectional translation of patient utterances. Knowledge base grounding (e.g., UMLS/SemMedDB) is maintained via language-agnostic mapping once extraction is complete. This design facilitates seamless deployment for bilingual and multilingual populations.
In low-resource settings, DocCHA operates as a prompt-driven orchestration layer atop any third-party LLM (including open-source LLaMA-3 or proprietary GPT-4o), requiring no model fine-tuning. Strict quota enforcement and local confidence computation make the framework computationally efficient and practical for constrained environments.
6. Clinical Relevance and Methodological Implications
DocCHA directly embeds clinical best practices—such as iterative clarification, prioritization of high-yield information, and explicit causal explanation—into its modular design. By quantitatively managing conversational uncertainty and resource allocation, the framework supports trustworthy, interpretable decision support in high-stakes dialogue settings. Empirical results indicate that DocCHA enhances both diagnostic accuracy and interpretability for interactive LLM-based health agents (Liu et al., 10 Jul 2025).
This suggests that modular, confidence-driven orchestration offers a viable pathway for advancing structured, efficient, and transparent LLM-enabled diagnostic systems in diverse, high-variance real-world contexts.