CliniChat: A Multi-Source Knowledge-Driven Framework for Clinical Interview Dialogue Reconstruction and Evaluation (2504.10418v1)

Published 14 Apr 2025 in cs.CL

Abstract: LLMs hold great promise for assisting clinical interviews due to their fluent interactive capabilities and extensive medical knowledge. However, the lack of high-quality interview dialogue data and widely accepted evaluation methods has significantly impeded this process. So we propose CliniChat, a framework that integrates multi-source knowledge to enable LLMs to simulate real-world clinical interviews. It consists of two modules: Clini-Recon and Clini-Eval, each responsible for reconstructing and evaluating interview dialogues, respectively. By incorporating three sources of knowledge, Clini-Recon transforms clinical notes into systematic, professional, and empathetic interview dialogues. Clini-Eval combines a comprehensive evaluation metric system with a two-phase automatic evaluation approach, enabling LLMs to assess interview performance like experts. We contribute MedQA-Dialog, a high-quality synthetic interview dialogue dataset, and CliniChatGLM, a model specialized for clinical interviews. Experimental results demonstrate that CliniChatGLM's interview capabilities undergo a comprehensive upgrade, particularly in history-taking, achieving state-of-the-art performance.

Summary

The paper introduces CliniChat, a framework that reconstructs high-quality clinical interview dialogues using a multi-source, knowledge-driven approach.
It details Clini-Recon and Clini-Eval modules that generate structured dialogues and provide automated evaluation via LLM-guided metrics.
Experiments demonstrate improved dialogue quality and cost-effective LLM performance, particularly in history taking and interview techniques for clinical applications.

This paper introduces CliniChat, a framework designed to enable LLMs to simulate realistic clinical interviews, addressing the challenges of data scarcity and lack of standardized evaluation methods in this domain (2504.10418). The framework aims to facilitate the development of LLM-based tools for assisting physicians in clinical interviews, particularly in history taking, which is crucial for diagnosis.

CliniChat consists of two core modules:

Clini-Recon: Responsible for reconstructing high-quality clinical interview dialogues from clinical notes.
Clini-Eval: Provides a method for automatically evaluating the quality of these simulated dialogues using LLMs.

Source Data and Preprocessing

Due to privacy concerns surrounding real clinical notes, the framework utilizes case paper questions from the MedQA-USMLE dataset [jin2021disease] as a proxy. These questions simulate clinical scenarios and follow a structure similar to the standardized SOAP (Subjective, Objective, Assessment, Plan) format used in clinical documentation. The paper analyzes 9,123 such questions, covering 3,154 diseases across 19 hospital departments.

Clini-Recon: Dialogue Reconstruction

Clini-Recon reconstructs dialogues by treating clinical notes as the main ingredient and adding necessary components through four sub-tasks:

Interview Planning: Instead of relying on LLMs for planning, which can be unreliable [valmeekam2023planning], Clini-Recon uses a manually crafted interview plan based on the SOAP structure, authoritative guidelines (like UCL's "GUIDE TO HISTORY TAKING AND EXAMINATION"), and input from experienced physicians. This plan defines the interview sections (Subjective, Objective, Assessment, Plan - excluding treatment) and the specific questions/interactions within each. It incorporates techniques like using open-ended and closed-ended questions, avoiding leading questions, and adding a "customized inquiry" section for specific patient groups/diseases. Placeholders are used for information not directly present in the source note.
Knowledge Preparation: To fill the placeholders and bridge the knowledge gap between the source note and the planned interview (especially regarding omitted symptoms or diagnostic reasoning in MedQA questions), this step uses an LLM (specifically, the cost-effective GLM-4-Air) to perform clinical reasoning on the source note. It generates structured knowledge, including 'Preliminary Diagnosis' (Most Likely Disease, Differential Diagnosis), 'Diagnostic Basis,' 'Confirmatory Tests,' 'Signs and Symptoms,' 'Risk Factors,' and 'Customized Inquiry' elements. This generated knowledge is mapped to the placeholders in the interview plan.
Role Setting: Defines the characteristics and rules for the simulated physician and patient. The physician is prompted to be empathetic, professional, and use clear language. The patient is prompted to be cooperative, answer honestly based on the source note, use lay language, and respond with "No" or "Not sure" to questions outside the note's scope.
Dialogue Generation: Using the manually defined plan, the prepared knowledge, the role settings, and the original source note (MedQA question), an LLM (GLM-4-Air in this paper) generates the final multi-turn interview dialogue. This structured approach allows less advanced, cheaper LLMs to produce high-quality results.

Using Clini-Recon with GLM-4-Air, the authors created the MedQA-Dialog dataset, containing 10,263 dialogues with an average of 32 turns.

Clini-Eval: Dialogue Evaluation

Clini-Eval provides a comprehensive evaluation methodology:

Evaluation Metrics: A detailed metric system with 6 main categories and 30 sub-metrics was developed, drawing from real-world clinical interview scoring criteria (e.g., Peking Union Medical College, Tulane's MASTER scale) and adapting them for LLM simulations. Key metrics include "Mastery of Patient Medical History," "Interviewing Techniques," "Medical Examination Results Consistency," "Diagnosis Consistency," "Diagnostic Basis Consistency," and "Confirmatory Tests Consistency." Specific metrics like "Max Two Questions per Inquiry" and "Brief and To-the-Point Responses" cater to LLM interaction styles. (See Table 6 in the paper for full details).
Demo2Eval Method: An LLM-based, two-phase automated evaluation approach inspired by demonstration teaching in medical training:
- Demo Generation: A powerful LLM (GPT-4o) acts as a "senior physician" to analyze the source clinical note and generate an "interview demonstration." This involves extracting/inferring diagnostic conclusions and creating a detailed history-taking plan based on the note.
- Comparative Evaluation: The same LLM (GPT-4o), now acting as a "clinical instructor," evaluates the simulated interview dialogue by comparing it against the generated demonstration, using the defined metric system. It performs a subjective assessment followed by quantitative scoring.

Experiments and Implementation

Intrinsic Evaluation: Dialogues generated by Clini-Recon (+ GLM-4-Air) were compared to dialogues generated using direct and interactive role-play prompts with GLM-4-Air and GPT-4o. Clini-Recon produced significantly longer dialogues (avg. 28.7 turns vs. ~8-11) with more concise utterances, achieving much higher scores (+28.9% total score vs. best baseline), especially in "Mastery of Patient Medical History" (+50.6%) and "Interview Techniques" (+22.5%). Adaptability varied by department, performing best in Cardiology/Neurology and less effectively in Psychiatry.
Extrinsic Evaluation: The ChatGLM2-6B model was fine-tuned on the MedQA-Dialog dataset using P-Tuning v2 [liu2021p] to create CliniChatGLM.
- Hyperparameters: Key settings included prefix_sequence_length=128, max_source_length=2048, max_target_length=128, learning_rate=1e-2. (See Table 5 in the paper).
- Evaluation: CliniChatGLM was compared against its base model, other medical LLMs (BianQue, HuatuoGPT), and general LLMs (GLM-4-Air, Spark4.0 Ultra) in simulated interviews where GLM-4-Air played the patient role based on MedQA test set questions. Clini-Eval was used for assessment.
- Results: CliniChatGLM significantly outperformed other models, including GLM-4-Air, in history-taking (+36.4%) and interview techniques (+28.7%), demonstrating the effectiveness of fine-tuning on the structured MedQA-Dialog data. It inherited the high turn count and conciseness. While its diagnostic consistency improved vastly over the base model, it still lagged slightly behind the more powerful GLM-4-Air in these areas, suggesting a need for further enhancement of its knowledge base and reasoning.

Practical Implications and Contributions

End-to-End Framework: Provides a structured pipeline from data generation (Clini-Recon) to model training and evaluation (Clini-Eval) for developing LLM-based clinical interview assistants.
High-Quality Synthetic Data: The MedQA-Dialog dataset offers a valuable resource for training LLMs in clinical dialogue, simulating key aspects like structured history taking and empathy.
Cost-Effective Data Generation: Clini-Recon's design allows the use of less expensive LLMs (like GLM-4-Air) for dialogue generation, reducing costs compared to relying solely on top-tier models.
Specialized Model: CliniChatGLM demonstrates the potential of fine-tuning smaller models on specialized, high-quality dialogue data for specific clinical tasks like history taking.
Comprehensive Evaluation: Clini-Eval and its Demo2Eval method offer a detailed, automated way to assess LLM performance in simulated interviews, using metrics relevant to clinical practice.
Open Resources: The authors plan to release the MedQA-Dialog dataset, CliniChatGLM model, and code, facilitating further research and application.

The paper acknowledges limitations, such as not using state-of-the-art LLMs for reconstruction and relying solely on automated evaluation. It also highlights the ethical considerations, emphasizing that CliniChatGLM is a research model not intended for direct clinical use due to potential inaccuracies and limitations in flexibility compared to human physicians.

PDF Markdown

YouTube

Show All Videos