DocCHA Framework for Diagnostic Dialogue

Updated 29 May 2026

The paper demonstrates that modular, confidence-guided reasoning in DocCHA improves diagnostic accuracy by up to 5.18% compared to conventional LLM pipelines.
DocCHA is a framework that segments diagnostic consultations into symptom elicitation, history acquisition, and causal graph reasoning to mimic clinical reasoning.
Its iterative confidence quantification mechanisms trigger dynamic clarifications, ensuring structured diagnoses while maintaining computational efficiency.

DocCHA is a modular, confidence-aware framework for LLM-driven online diagnostic dialogue, explicitly designed to emulate clinical reasoning. Unlike static Conversational Health Agents (CHAs), DocCHA systems dynamically prioritize symptom clarification, history acquisition, and explanatory causal reasoning through a staged, confidence-guided process, producing structured and transparent diagnostic consultations. Empirical evaluation on real-world Chinese datasets (IMCS21, DX) demonstrates that DocCHA, when instantiated with GPT-4o, achieves up to 5.18% higher diagnostic accuracy and over 30% greater information recall compared to prompting-based LLM pipelines, without incurring substantial increases in dialogue length (Liu et al., 10 Jul 2025).

1. System Architecture and Modular Pipeline

DocCHA is structured as a sequential, three-stage pipeline reflecting canonical clinical practice:

Symptom Elicitation: Extraction and clarification of presented symptoms.
History Acquisition: Collection of relevant patient history, exposures, and risk factors.
Causal Graph Construction & Refinement: Reasoned chaining of symptoms and history into explanatory links leading to candidate diagnoses.

All modules interact via a shared global confidence state, which controls stage advancement and triggers iterative backtracking or refinements when uncertainty or missing data are detected. Downstream module confidence can force a return to prior stages, supporting adaptive multi-turn dialogue. The overall workflow enforces strict query budgets (maximum of four follow-ups per module, by default) for computational and user-efficiency.

Diagrammatic overview (text):

$DP(s_i) = \mathrm{Var}_{d_k \in D} [P(s_i|d_k)]$ 0

2. Confidence Quantification Mechanisms

Each module computes an interpretable, multi-component confidence score, combining coverage, detail, relevance, and model uncertainty as appropriate. A module proceeds to the next stage only if its confidence score meets a pre-set threshold; otherwise, it triggers additional evidence-seeking queries or clarifications.

2.1 Symptom Elicitation Module

Let $S_n = \{s_1 \ldots s_n\}$ denote extracted symptoms and $D = \{d_1 \ldots d_K\}$ candidate diagnoses. Key metrics:

Discriminative Power: $DP(s_i) = \mathrm{Var}_{d_k \in D} [P(s_i|d_k)]$
Coverage Confidence: $C_{cov} = \max_{d_k} |S_n \cap S_{d_k}| / |S_{d_k}|$
Detail Confidence: $C_{det} = (1/|S_n|) \sum_{s_i} |A(s_i)| / |A_{req}(s_i)|$
Combined Symptom Confidence: $C_{sym} = \alpha \cdot C_{cov} + (1-\alpha) \cdot C_{det}$

Low $C_{sym}$ triggers question selection according to what most increases discriminative yield or fills missing attribute details.

2.2 History Acquisition Module

Let $H_m = \{h_1 \ldots h_m\}$ represent acquired history, and $H_{(D)}$ union all expected categories per candidate diagnosis. Key metrics:

Coverage: $C_{cov}^h = |H_m \cap H_{(D)}| / |H_{(D)}|$
Relevance: $D = \{d_1 \ldots d_K\}$ 0
Certainty: $D = \{d_1 \ldots d_K\}$ 1
Combined History Confidence: $D = \{d_1 \ldots d_K\}$ 2

Follow-up policy targets missing history categories or low-certainty statements.

2.3 Causal Graph Module

A directed graph $D = \{d_1 \ldots d_K\}$ 3 is built with $D = \{d_1 \ldots d_K\}$ 4 as symptoms, history, and diagnoses; $D = \{d_1 \ldots d_K\}$ 5 contains directed edges encoding hypothesized causal or temporal links. Scoring includes:

Coherence (BARTScore): $D = \{d_1 \ldots d_K\}$ 6
Medical Plausibility: $D = \{d_1 \ldots d_K\}$ 7
Entailment (MedNLI): $D = \{d_1 \ldots d_K\}$ 8
Combined Causal Confidence: $D = \{d_1 \ldots d_K\}$ 9

Edges with weak entailment or lacking UMLS support are targeted for clarification.

Example: Pipeline Pseudocode

$DP(s_i) = \mathrm{Var}_{d_k \in D} [P(s_i|d_k)]$ 1

3. Dialogue Policies and Iterative Reasoning

Symptom and history clarification policies are shaped by the discriminative utility of information and confidence shortfalls. Symptoms are prioritized by discriminative power, triggering distinct prompts for low detail versus low coverage:

Low coverage: General prompts about unmentioned hallmark symptoms for differential diagnoses.
Low detail: Attribute-specific follow-ups (e.g., onset, severity, duration).

History module targets under-documented categories (e.g., exposures, medications), enforcing hard query caps for budget-awareness. Both modules integrate seamlessly into the broader confidence loop.

The causal graph module uses LLM-driven chain-of-thought to explain candidate diagnoses via a directed graph; this structure is iteratively refined through clarification of uncertainty-heavy or medically unsupported links.

4. Performance Evaluation

DocCHA's efficacy is demonstrated via benchmarking against state-of-the-art LLM baselines, using two Chinese clinical consultation datasets: IMCS21 (100 pediatric cases) and DX (120 multi-turn adult cases), both with gold-standard diagnostic labels. The following metrics are reported:

Accuracy (Acc.): Exact match to gold diagnosis.
Cosine similarity (cos): Embedding-based label proximity.
Information Recall (Recall_info): Proportion of gold-standard cues elicited.
Average turns (n): Dialogue length efficiency.

Results Summary

Method	Accuracy	Cosine	Recall_info	Turns
LLaMA-3	66.23	56.07	47.72%	5.6
GPT-3.5	87.69	62.20	52.04%	6.4
GPT-4o	90.68	62.27	52.12%	6.8
DocCHA (GPT-4o)	95.86	65.68	54.49%	7.1

DocCHA exhibits a 5.18% accuracy gain over GPT-4o, with 2.37 percentage point improvement in information recall, and only a 0.3 increase in dialogue turns. On the DX dataset, accuracy reaches 94.14% (+4.18% over GPT-4o), with ~30% higher cue recall. Ablation experiments confirm critical contributions from each module: removing the symptom module reduces accuracy by ~2.4%, while removing graph reasoning costs ~1.6% (Liu et al., 10 Jul 2025).

5. Multilingual and Resource-Aware Operation

DocCHA supports Chinese by direct prompting in Chinese or with bidirectional translation of patient utterances. Knowledge base grounding (e.g., UMLS/SemMedDB) is maintained via language-agnostic mapping once extraction is complete. This design facilitates seamless deployment for bilingual and multilingual populations.

In low-resource settings, DocCHA operates as a prompt-driven orchestration layer atop any third-party LLM (including open-source LLaMA-3 or proprietary GPT-4o), requiring no model fine-tuning. Strict quota enforcement and local confidence computation make the framework computationally efficient and practical for constrained environments.

6. Clinical Relevance and Methodological Implications

DocCHA directly embeds clinical best practices—such as iterative clarification, prioritization of high-yield information, and explicit causal explanation—into its modular design. By quantitatively managing conversational uncertainty and resource allocation, the framework supports trustworthy, interpretable decision support in high-stakes dialogue settings. Empirical results indicate that DocCHA enhances both diagnostic accuracy and interpretability for interactive LLM-based health agents (Liu et al., 10 Jul 2025).

This suggests that modular, confidence-driven orchestration offers a viable pathway for advancing structured, efficient, and transparent LLM-enabled diagnostic systems in diverse, high-variance real-world contexts.

Markdown Report Issue Upgrade to Chat

References (1)

DocCHA: Towards LLM-Augmented Interactive Online diagnosis System (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DocCHA Framework.