A Data-Centric Approach To Generate Faithful and High Quality Patient Summaries with Large Language Models

Published 23 Feb 2024 in cs.CL, cs.AI, and cs.LG | (2402.15422v2)

Abstract: Patients often face difficulties in understanding their hospitalizations, while healthcare workers have limited resources to provide explanations. In this work, we investigate the potential of LLMs to generate patient summaries based on doctors' notes and study the effect of training data on the faithfulness and quality of the generated summaries. To this end, we release (i) a rigorous labeling protocol for errors in medical texts and (ii) a publicly available dataset of annotated hallucinations in 100 doctor-written and 100 generated summaries. We show that fine-tuning on hallucination-free data effectively reduces hallucinations from 2.60 to 1.55 per summary for Llama 2, while preserving relevant information. We observe a similar effect on GPT-4 (0.70 to 0.40), when the few-shot examples are hallucination-free. We also conduct a qualitative evaluation using hallucination-free and improved training data. We find that common quantitative metrics do not correlate well with faithfulness and quality. Finally, we test GPT-4 for automatic hallucination detection, which clearly outperforms common baselines.

Abstract PDF HTML Upgrade to Chat

References (61)

Citations (3)

View on Semantic Scholar

Summary

The paper introduces a data-centric method that uses detailed hallucination annotation to curate a high-quality dataset for training LLMs.
The study demonstrates that fine-tuning models on cleaned patient notes can significantly reduce hallucinations while preserving essential medical information.
The evaluation reveals that automatic metrics like ROUGE often fail to capture human judgments of quality, emphasizing the need for expert evaluation in clinical settings.

This essay provides a detailed summary and practical implications of the paper "A Data-Centric Approach To Generate Faithful and High Quality Patient Summaries with LLMs" (2402.15422). The authors investigate the application of LLMs to generate patient summaries from doctors' notes, emphasizing the impact of training data quality on faithfulness and overall quality. The study introduces a rigorous protocol for hallucination annotation, which is then used to curate a high-quality dataset for fine-tuning LLMs.

The Imperative for Patient-Oriented Summaries

The medical community faces a persistent challenge wherein a substantial portion of patients struggle to comprehend their hospitalization experiences and subsequent care instructions. This comprehension gap has demonstrable negative consequences, including increased hospital readmission rates and reduced adherence to treatment regimens. Patient-oriented summaries, phrased in layperson language and encapsulating critical medical facts, represent a potential intervention to bridge this gap. However, the manual creation of such high-quality summaries is a labor-intensive endeavor, exacerbating the already heavy workload of healthcare professionals.

LLMs have shown considerable promise in various natural language generation (NLG) tasks, including medical summarization. Nevertheless, a significant obstacle to their deployment in clinical settings is their propensity for generating "hallucinations"—factual inaccuracies or information not supported by the source text. In medical contexts, such inaccuracies are particularly critical due to potential patient harm. The fragmented nature of clinical datasets, which may not always contain the complete patient history, further complicates this issue, as human-written summaries in these datasets may themselves contain unsupported information, leading to the propagation of such artifacts when used for model training.

Data Curation and Hallucination Annotation Protocol

To address the challenge of hallucinations, the authors adopt a data-centric approach, focusing on improving the quality of training data rather than solely refining model architectures or training algorithms. They construct a filtered dataset, MIMIC-IV-Note-DI, from the MIMIC-IV-Note dataset, comprising 100,175 hospital course and patient summary pairs. A crucial aspect of this work is the development and application of a detailed annotation protocol for identifying hallucinations in patient summaries. This protocol, adapted from prior work, distinguishes hallucinations into nine subcategories of unsupported facts (e.g., condition, procedure, medication, time) and two general error labels: contradicted facts and incorrect facts. Unlike previous approaches that might consider external knowledge, the Brief Hospital Course (BHC) section of the doctor's note is established as the sole ground truth for factual accuracy, streamlining the annotation process for medical experts.

Figure 1: A synthetic MIMIC example labeled with the developed annotation protocol for hallucinations. The protocol was adapted from [thomson_gold_2020] and we used eleven different labels.

Two medical students, having completed their second state examination, independently annotated 100 real-world MIMIC summaries and 100 machine-generated summaries using this protocol. The annotation process, including training and consensus discussions, revealed the inherent complexity and subjectivity of hallucination detection. Notably, "word unsupported" was the most frequent hallucination type in both human-written and generated summaries, indicating nuanced errors beyond simple entity misrepresentations. The inter-annotator agreement statistics underscore this difficulty, with F1-scores for span overlap remaining moderate even when ignoring class labels. This highlights the limitations of purely automatic detection and the continued necessity of expert human evaluation for robust quality assessment.

Impact of Data Quality on LLM Performance

The study demonstrates a strong correlation between the quality of training data and the faithfulness of generated summaries. Fine-tuning Llama 70B on a "cleaned" dataset, where identified hallucinations were manually removed or replaced, resulted in a substantial reduction in hallucinations from 2.60 to 1.55 per summary, while crucially preserving the number of key facts.

Figure 2: Generated summaries from Llama 70B trained on 100 original and 100 cleaned examples for synthetic context given in Figure \ref{fig:labelling_example}.

This contrasts with GPT-4, which showed only a marginal reduction in hallucinations (from 0.70 to 0.40 per summary) when prompted with cleaned examples, suggesting that larger, more capable models might be less susceptible to minor data quality issues or implicitly learn to disregard them due to their extensive pre-training. However, the study also reveals that even GPT-4's performance can be influenced by the number and quality of in-context examples, with a slight decrease in relevance and simplification when more examples are used, indicating a trade-off in prompt-based learning.

Quantitative evaluation using metrics such as ROUGE, BERTScore, and SARI did not correlate well with human judgments of faithfulness and quality. For instance, the LED-large model, despite outperforming LLama 2 and GPT-4 on these automated metrics, yielded qualitatively inferior summaries, often mirroring the shortcomings of the original MIMIC summaries. This reinforces a recurring observation in NLP research: automated metrics often fail to capture the nuanced human perception of quality, especially in high-stakes domains like healthcare where factual accuracy and interpretability are paramount.

Qualitative Evaluation and Future Directions

A comprehensive qualitative evaluation, involving medical experts rating summaries for relevance, consistency, simplification, fluency, and coherence, highlighted GPT-4's superior performance across all dimensions, even in a zero-shot setting. It achieved high relevance scores (averaging one missing key fact per summary) and excellent ratings for simplification, fluency, and coherence (above 4.5 on a 5-point Likert scale), often surpassing the quality of human-written summaries. This suggests that LLMs, particularly advanced models like GPT-4, can produce high-quality patient-facing summaries that are more effective than existing human-authored ones, provided they are appropriately guided.

Figure 3: Qualitative evaluation for Llama 70B and GPT-4 5-shot trained and prompted on cleaned {additional_guidance} improved data. We compared them to the MIMIC summaries, LED-large trained on all MIMIC data, and GPT-4 0-shot. 20 summaries were evaluated for each model by two medical experts.

The study also explored automatic hallucination detection using MedCat for entity extraction and SapBERT embeddings. This approach yielded poor performance, underscoring the limitations of current medical entity-based methods for this complex task. GPT-4, however, demonstrated superior performance in automatic hallucination detection, especially with class-aware prompting, although recall for certain hallucination types remained low. This indicates a promising but nascent area of research, emphasizing the need for more sophisticated methods for reliable automatic detection.

The findings have significant implications. Practically, the data-centric approach, even with a small number of carefully curated examples, proves effective for reducing hallucinations in LLM-generated medical summaries. This is particularly relevant for clinical applications where trust and accuracy are critical. Theoretically, the disparity between automated metrics and human judgment highlights a fundamental challenge in evaluating NLG systems in specialized domains. Future work should explore the integration of LLMs with different summary formats and conduct further clinical validation to assess their real-world impact on patient understanding and health outcomes. Such efforts will require interdisciplinary collaboration to ensure that AI systems in healthcare are not only performant but also safe, reliable, and genuinely beneficial to patients.

Markdown