Papers
Topics
Authors
Recent
2000 character limit reached

A Data-Centric Approach To Generate Faithful and High Quality Patient Summaries with Large Language Models

Published 23 Feb 2024 in cs.CL, cs.AI, and cs.LG | (2402.15422v2)

Abstract: Patients often face difficulties in understanding their hospitalizations, while healthcare workers have limited resources to provide explanations. In this work, we investigate the potential of LLMs to generate patient summaries based on doctors' notes and study the effect of training data on the faithfulness and quality of the generated summaries. To this end, we release (i) a rigorous labeling protocol for errors in medical texts and (ii) a publicly available dataset of annotated hallucinations in 100 doctor-written and 100 generated summaries. We show that fine-tuning on hallucination-free data effectively reduces hallucinations from 2.60 to 1.55 per summary for Llama 2, while preserving relevant information. We observe a similar effect on GPT-4 (0.70 to 0.40), when the few-shot examples are hallucination-free. We also conduct a qualitative evaluation using hallucination-free and improved training data. We find that common quantitative metrics do not correlate well with faithfulness and quality. Finally, we test GPT-4 for automatic hallucination detection, which clearly outperforms common baselines.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (61)
  1. What’s in a Summary? Laying the Groundwork for Advances in Hospital-Course Summarization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4794–4811, Online, June 2021. Association for Computational Linguistics. 10.18653/v1/2021.naacl-main.382. URL https://aclanthology.org/2021.naacl-main.382.
  2. Learning to Revise References for Faithful Summarization, October 2022. URL http://arxiv.org/abs/2204.10290. arXiv:2204.10290 [cs].
  3. A Meta-Evaluation of Faithfulness Metrics for Long-Form Hospital-Course Summarization, March 2023. URL http://arxiv.org/abs/2303.03948. arXiv:2303.03948 [cs].
  4. Focus Attention: Promoting Faithfulness and Diversity in Summarization. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6078–6095, Online, August 2021. Association for Computational Linguistics. 10.18653/v1/2021.acl-long.474. URL https://aclanthology.org/2021.acl-long.474.
  5. Large language models in simplifying radiological reports: systematic review. preprint, Radiology and Imaging, January 2024. URL http://medrxiv.org/lookup/doi/10.1101/2024.01.05.24300884.
  6. Evaluation of drug information for cardiology patients. British Journal of Clinical Pharmacology, 31(5):525–531, 1991. ISSN 1365-2125. 10.1111/j.1365-2125.1991.tb05574.x. URL https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1365-2125.1991.tb05574.x. _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.1365-2125.1991.tb05574.x.
  7. Interventions to Improve Communication at Hospital Discharge and Rates of Readmission: A Systematic Review and Meta-analysis. JAMA Network Open, 4(8):e2119346, August 2021. ISSN 2574-3805. 10.1001/jamanetworkopen.2021.19346. URL https://jamanetwork.com/journals/jamanetworkopen/fullarticle/2783547.
  8. Longformer: The Long-Document Transformer, December 2020. URL http://arxiv.org/abs/2004.05150. arXiv:2004.05150 [cs].
  9. Lukas Biewald. Experiment tracking with weights and biases. Software available from wandb. com, 2:233, 2020.
  10. Generation of Patient After-Visit Summaries to Support Physicians. In Proceedings of the 29th International Conference on Computational Linguistics, pages 6234–6247, Gyeongju, Republic of Korea, October 2022. International Committee on Computational Linguistics. URL https://aclanthology.org/2022.coling-1.544.
  11. Shuyang Cao and Lu Wang. CLIFF: Contrastive Learning for Improving Faithfulness and Factuality in Abstractive Summarization. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6633–6649, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. 10.18653/v1/2021.emnlp-main.532. URL https://aclanthology.org/2021.emnlp-main.532.
  12. Faithful to the original: fact-aware neural abstractive summarization. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence, AAAI’18/IAAI’18/EAAI’18, pages 4784–4791, New Orleans, Louisiana, USA, February 2018. AAAI Press. ISBN 978-1-57735-800-8.
  13. Effect of personalised, mobile-accessible discharge instructions for patients leaving the emergency department: A randomised controlled trial. Emergency Medicine Australasia, 32(6):967–973, 2020. ISSN 1742-6723. 10.1111/1742-6723.13516. URL https://onlinelibrary.wiley.com/doi/abs/10.1111/1742-6723.13516. _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1111/1742-6723.13516.
  14. SummEval: Re-evaluating Summarization Evaluation, February 2021. URL http://arxiv.org/abs/2007.12626. arXiv:2007.12626 [cs].
  15. Challenges optimizing the after visit summary. International journal of medical informatics, 120:14–19, December 2018. ISSN 1386-5056. 10.1016/j.ijmedinf.2018.09.009. URL https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6326571/.
  16. Qlarify: Bridging Scholarly Abstracts and Papers with Recursively Expandable Summaries, October 2023. URL http://arxiv.org/abs/2310.07581. arXiv:2310.07581 [cs].
  17. Using Health Literacy and Learning Style Preferences to Optimize the Delivery of Health Information. Journal of Health Communication, 17(sup3):122–140, October 2012. ISSN 1081-0730, 1087-0415. 10.1080/10810730.2012.712610. URL http://www.tandfonline.com/doi/abs/10.1080/10810730.2012.712610.
  18. PhysioBank, PhysioToolkit, and PhysioNet: Components of a New Research Resource for Complex Physiologic Signals. Circulation, 101(23), June 2000. ISSN 0009-7322, 1524-4539. 10.1161/01.CIR.101.23.e215. URL https://www.ahajournals.org/doi/10.1161/01.CIR.101.23.e215.
  19. Evaluating the Use of ChatGPT to Accurately Simplify Patient-centered Information about Breast Cancer Prevention and Screening. Radiology: Imaging Cancer, 6(2):e230086, March 2024. 10.1148/rycan.230086. URL https://pubs.rsna.org/doi/abs/10.1148/rycan.230086. Publisher: Radiological Society of North America.
  20. Karen S. Hayes. Randomized Trial of Geragogy-Based Medication Instruction in the Emergency Department. Nursing Research, 47(4):211, August 1998. ISSN 0029-6562. URL https://journals.lww.com/nursingresearchonline/fulltext/1998/07000/randomized_trial_of_geragogy_based_medication.6.aspx.
  21. MedTator: a serverless annotation tool for corpus development. Bioinformatics, 38(6):1776–1778, March 2022. ISSN 1367-4803. 10.1093/bioinformatics/btab880. URL https://doi.org/10.1093/bioinformatics/btab880.
  22. Teaching Machines to Read and Comprehend. In Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015. URL https://proceedings.neurips.cc/paper_files/paper/2015/hash/afdec7005cc9f14302cd0474fd0f3c96-Abstract.html.
  23. Quality of Discharge Practices and Patient Understanding at an Academic Medical Center. JAMA Internal Medicine, August 2013. ISSN 2168-6106. 10.1001/jamainternmed.2013.9318. URL http://archinte.jamanetwork.com/article.aspx?doi=10.1001/jamainternmed.2013.9318.
  24. LoRA: Low-Rank Adaptation of Large Language Models, October 2021. URL http://arxiv.org/abs/2106.09685. arXiv:2106.09685 [cs].
  25. The Factual Inconsistency Problem in Abstractive Text Summarization: A Survey, April 2023. URL http://arxiv.org/abs/2104.14839. arXiv:2104.14839 [cs].
  26. MIMIC-IV-Note: Deidentified free-text clinical notes, January 2023. URL https://physionet.org/content/mimic-iv-note/2.2/.
  27. Patients’ Understanding of Their Hospitalizations and Association With Satisfaction. JAMA Internal Medicine, 174(10):1698–1700, October 2014. ISSN 2168-6106. 10.1001/jamainternmed.2014.3765. URL https://doi.org/10.1001/jamainternmed.2014.3765.
  28. Multi-domain clinical natural language processing with MedCAT: The Medical Concept Annotation Toolkit. Artificial Intelligence in Medicine, 117:102083, July 2021. ISSN 0933-3657. 10.1016/j.artmed.2021.102083. URL https://www.sciencedirect.com/science/article/pii/S0933365721000762.
  29. Klaus Krippendorff. Content Analysis: An Introduction to Its Methodology. SAGE Publications, May 2018. ISBN 978-1-5063-9567-8. Google-Books-ID: nE1aDwAAQBAJ.
  30. Evaluating the Factual Consistency of Abstractive Text Summarization. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9332–9346, Online, November 2020. Association for Computational Linguistics. 10.18653/v1/2020.emnlp-main.750. URL https://aclanthology.org/2020.emnlp-main.750.
  31. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, Online, July 2020. Association for Computational Linguistics. 10.18653/v1/2020.acl-main.703. URL https://aclanthology.org/2020.acl-main.703.
  32. Chin-Yew Lin. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics. URL https://aclanthology.org/W04-1013.
  33. Effect of a patient-directed discharge letter on patient understanding of their hospitalisation. Internal Medicine Journal, 44(9):851–857, 2014. ISSN 1445-5994. 10.1111/imj.12482. URL https://onlinelibrary.wiley.com/doi/abs/10.1111/imj.12482. _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1111/imj.12482.
  34. Self-Alignment Pretraining for Biomedical Entity Representations. In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou, editors, Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4228–4238, Online, June 2021. Association for Computational Linguistics. 10.18653/v1/2021.naacl-main.334. URL https://aclanthology.org/2021.naacl-main.334.
  35. Laurens van der Maaten and Geoffrey Hinton. Visualizing Data using t-SNE. Journal of Machine Learning Research, 9(86):2579–2605, 2008. ISSN 1533-7928. URL http://jmlr.org/papers/v9/vandermaaten08a.html.
  36. On Faithfulness and Factuality in Abstractive Summarization. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1906–1919, Online, July 2020. Association for Computational Linguistics. 10.18653/v1/2020.acl-main.173. URL https://aclanthology.org/2020.acl-main.173.
  37. Fine-grained Hallucination Detection and Editing for Language Models, January 2024. URL http://arxiv.org/abs/2401.06855. arXiv:2401.06855 [cs].
  38. Human Evaluation and Correlation with Automatic Metrics in Consultation Note Generation. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5739–5754, Dublin, Ireland, May 2022. Association for Computational Linguistics. 10.18653/v1/2022.acl-long.394. URL https://aclanthology.org/2022.acl-long.394.
  39. Readability of patient discharge instructions with and without the use of electronically available disease-specific templates. Journal of the American Medical Informatics Association, 22(4):857–863, July 2015. ISSN 1527-974X, 1067-5027. 10.1093/jamia/ocv005. URL https://academic.oup.com/jamia/article/22/4/857/1746200.
  40. Entity-level Factual Consistency of Abstractive Text Summarization. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 2727–2733, Online, April 2021. Association for Computational Linguistics. 10.18653/v1/2021.eacl-main.235. URL https://aclanthology.org/2021.eacl-main.235.
  41. GPT-4 Technical Report, December 2023. URL http://arxiv.org/abs/2303.08774. arXiv:2303.08774 [cs].
  42. Patient-Reported Use of the After Visit Summary in a Primary Care Internal Medicine Practice. Journal of Patient Experience, 7(5):703–707, October 2020. ISSN 2374-3735. 10.1177/2374373519879286. URL https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7705830/.
  43. Carswella Phillips. Relationships between workload perception, burnout, and intent to leave among medical–surgical nurses. JBI Evidence Implementation, 18(2):265, June 2020. ISSN 2691-3321. 10.1097/XEB.0000000000000220. URL https://journals.lww.com/ijebh/abstract/2020/06000/relationships_between_workload_perception,.11.aspx.
  44. SemEval-2013 Task 9 : Extraction of Drug-Drug Interactions from Biomedical Texts (DDIExtraction 2013). In Suresh Manandhar and Deniz Yuret, editors, Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), pages 341–350, Atlanta, Georgia, USA, June 2013. Association for Computational Linguistics. URL https://aclanthology.org/S13-2056.
  45. WikiChat: Stopping the Hallucination of Large Language Model Chatbots by Few-Shot Grounding on Wikipedia, October 2023. URL http://arxiv.org/abs/2305.14292. arXiv:2305.14292 [cs].
  46. A Gold Standard Methodology for Evaluating Accuracy in Data-To-Text Systems. In Brian Davis, Yvette Graham, John Kelleher, and Yaji Sripada, editors, Proceedings of the 13th International Conference on Natural Language Generation, pages 158–168, Dublin, Ireland, December 2020. Association for Computational Linguistics. URL https://aclanthology.org/2020.inlg-1.22.
  47. Generation Challenges: Results of the Accuracy Evaluation Shared Task, August 2021. URL http://arxiv.org/abs/2108.05644. arXiv:2108.05644 [cs].
  48. Evaluating factual accuracy in complex data-to-text. Computer Speech & Language, 80:101482, May 2023. ISSN 0885-2308. 10.1016/j.csl.2023.101482. URL https://www.sciencedirect.com/science/article/pii/S0885230823000013.
  49. Fine-tuning Language Models for Factuality, November 2023. URL http://arxiv.org/abs/2311.08401. arXiv:2311.08401 [cs].
  50. Llama 2: Open Foundation and Fine-Tuned Chat Models, July 2023. URL http://arxiv.org/abs/2307.09288. arXiv:2307.09288 [cs].
  51. RadAdapt: Radiology Report Summarization via Lightweight Domain Adaptation of Large Language Models, May 2023a. URL http://arxiv.org/abs/2305.01146. arXiv:2305.01146 [cs].
  52. Clinical Text Summarization: Adapting Large Language Models Can Outperform Human Experts, October 2023b. URL http://arxiv.org/abs/2309.07430. arXiv:2309.07430 [cs].
  53. Impact of physician workload on burnout in the emergency department. Psychology, Health & Medicine, 24(4):414–428, April 2019. ISSN 1354-8506. 10.1080/13548506.2018.1539236. URL https://doi.org/10.1080/13548506.2018.1539236. Publisher: Taylor & Francis _eprint: https://doi.org/10.1080/13548506.2018.1539236.
  54. Unsupervised Clinical Language Translation. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’19, pages 3121–3131, New York, NY, USA, July 2019. Association for Computing Machinery. ISBN 978-1-4503-6201-6. 10.1145/3292500.3330710. URL https://dl.acm.org/doi/10.1145/3292500.3330710.
  55. HuggingFace’s Transformers: State-of-the-art Natural Language Processing, July 2020. URL http://arxiv.org/abs/1910.03771. arXiv:1910.03771 [cs].
  56. FactReranker: Fact-guided Reranker for Faithful Radiology Report Summarization, March 2023. URL http://arxiv.org/abs/2303.08335. arXiv:2303.08335 [cs].
  57. Optimizing Statistical Machine Translation for Text Simplification. Transactions of the Association for Computational Linguistics, 4:401–415, 2016. 10.1162/tacl_a_00107. URL https://aclanthology.org/Q16-1029. Place: Cambridge, MA Publisher: MIT Press.
  58. BERTScore: Evaluating Text Generation with BERT, February 2020a. URL http://arxiv.org/abs/1904.09675. arXiv:1904.09675 [cs].
  59. Optimizing the Factual Correctness of a Summary: A Study of Summarizing Radiology Reports. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5108–5120, Online, July 2020b. Association for Computational Linguistics. 10.18653/v1/2020.acl-main.458. URL https://aclanthology.org/2020.acl-main.458.
  60. Detecting Hallucinated Content in Conditional Neural Sequence Generation. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 1393–1404, Online, August 2021. Association for Computational Linguistics. 10.18653/v1/2021.findings-acl.120. URL https://aclanthology.org/2021.findings-acl.120.
  61. LIMA: Less Is More for Alignment. In Thirty-seventh Conference on Neural Information Processing Systems, November 2023. URL https://openreview.net/forum?id=KBMOKmX2he.
Citations (3)

Summary

  • The paper introduces a data-centric method that uses detailed hallucination annotation to curate a high-quality dataset for training LLMs.
  • The study demonstrates that fine-tuning models on cleaned patient notes can significantly reduce hallucinations while preserving essential medical information.
  • The evaluation reveals that automatic metrics like ROUGE often fail to capture human judgments of quality, emphasizing the need for expert evaluation in clinical settings.

This essay provides a detailed summary and practical implications of the paper "A Data-Centric Approach To Generate Faithful and High Quality Patient Summaries with LLMs" (2402.15422). The authors investigate the application of LLMs to generate patient summaries from doctors' notes, emphasizing the impact of training data quality on faithfulness and overall quality. The study introduces a rigorous protocol for hallucination annotation, which is then used to curate a high-quality dataset for fine-tuning LLMs.

The Imperative for Patient-Oriented Summaries

The medical community faces a persistent challenge wherein a substantial portion of patients struggle to comprehend their hospitalization experiences and subsequent care instructions. This comprehension gap has demonstrable negative consequences, including increased hospital readmission rates and reduced adherence to treatment regimens. Patient-oriented summaries, phrased in layperson language and encapsulating critical medical facts, represent a potential intervention to bridge this gap. However, the manual creation of such high-quality summaries is a labor-intensive endeavor, exacerbating the already heavy workload of healthcare professionals.

LLMs have shown considerable promise in various natural language generation (NLG) tasks, including medical summarization. Nevertheless, a significant obstacle to their deployment in clinical settings is their propensity for generating "hallucinations"—factual inaccuracies or information not supported by the source text. In medical contexts, such inaccuracies are particularly critical due to potential patient harm. The fragmented nature of clinical datasets, which may not always contain the complete patient history, further complicates this issue, as human-written summaries in these datasets may themselves contain unsupported information, leading to the propagation of such artifacts when used for model training.

Data Curation and Hallucination Annotation Protocol

To address the challenge of hallucinations, the authors adopt a data-centric approach, focusing on improving the quality of training data rather than solely refining model architectures or training algorithms. They construct a filtered dataset, MIMIC-IV-Note-DI, from the MIMIC-IV-Note dataset, comprising 100,175 hospital course and patient summary pairs. A crucial aspect of this work is the development and application of a detailed annotation protocol for identifying hallucinations in patient summaries. This protocol, adapted from prior work, distinguishes hallucinations into nine subcategories of unsupported facts (e.g., condition, procedure, medication, time) and two general error labels: contradicted facts and incorrect facts. Unlike previous approaches that might consider external knowledge, the Brief Hospital Course (BHC) section of the doctor's note is established as the sole ground truth for factual accuracy, streamlining the annotation process for medical experts. Figure 1

Figure 1

Figure 1: A synthetic MIMIC example labeled with the developed annotation protocol for hallucinations. The protocol was adapted from [thomson_gold_2020] and we used eleven different labels.

Two medical students, having completed their second state examination, independently annotated 100 real-world MIMIC summaries and 100 machine-generated summaries using this protocol. The annotation process, including training and consensus discussions, revealed the inherent complexity and subjectivity of hallucination detection. Notably, "word unsupported" was the most frequent hallucination type in both human-written and generated summaries, indicating nuanced errors beyond simple entity misrepresentations. The inter-annotator agreement statistics underscore this difficulty, with F1-scores for span overlap remaining moderate even when ignoring class labels. This highlights the limitations of purely automatic detection and the continued necessity of expert human evaluation for robust quality assessment.

Impact of Data Quality on LLM Performance

The study demonstrates a strong correlation between the quality of training data and the faithfulness of generated summaries. Fine-tuning Llama 70B on a "cleaned" dataset, where identified hallucinations were manually removed or replaced, resulted in a substantial reduction in hallucinations from 2.60 to 1.55 per summary, while crucially preserving the number of key facts. Figure 2

Figure 2: Generated summaries from Llama 70B trained on 100 original and 100 cleaned examples for synthetic context given in Figure \ref{fig:labelling_example}.

This contrasts with GPT-4, which showed only a marginal reduction in hallucinations (from 0.70 to 0.40 per summary) when prompted with cleaned examples, suggesting that larger, more capable models might be less susceptible to minor data quality issues or implicitly learn to disregard them due to their extensive pre-training. However, the study also reveals that even GPT-4's performance can be influenced by the number and quality of in-context examples, with a slight decrease in relevance and simplification when more examples are used, indicating a trade-off in prompt-based learning.

Quantitative evaluation using metrics such as ROUGE, BERTScore, and SARI did not correlate well with human judgments of faithfulness and quality. For instance, the LED-large model, despite outperforming LLama 2 and GPT-4 on these automated metrics, yielded qualitatively inferior summaries, often mirroring the shortcomings of the original MIMIC summaries. This reinforces a recurring observation in NLP research: automated metrics often fail to capture the nuanced human perception of quality, especially in high-stakes domains like healthcare where factual accuracy and interpretability are paramount.

Qualitative Evaluation and Future Directions

A comprehensive qualitative evaluation, involving medical experts rating summaries for relevance, consistency, simplification, fluency, and coherence, highlighted GPT-4's superior performance across all dimensions, even in a zero-shot setting. It achieved high relevance scores (averaging one missing key fact per summary) and excellent ratings for simplification, fluency, and coherence (above 4.5 on a 5-point Likert scale), often surpassing the quality of human-written summaries. This suggests that LLMs, particularly advanced models like GPT-4, can produce high-quality patient-facing summaries that are more effective than existing human-authored ones, provided they are appropriately guided. Figure 3

Figure 3: Qualitative evaluation for Llama 70B and GPT-4 5-shot trained and prompted on cleaned {additional_guidance} improved data. We compared them to the MIMIC summaries, LED-large trained on all MIMIC data, and GPT-4 0-shot. 20 summaries were evaluated for each model by two medical experts.

The study also explored automatic hallucination detection using MedCat for entity extraction and SapBERT embeddings. This approach yielded poor performance, underscoring the limitations of current medical entity-based methods for this complex task. GPT-4, however, demonstrated superior performance in automatic hallucination detection, especially with class-aware prompting, although recall for certain hallucination types remained low. This indicates a promising but nascent area of research, emphasizing the need for more sophisticated methods for reliable automatic detection.

The findings have significant implications. Practically, the data-centric approach, even with a small number of carefully curated examples, proves effective for reducing hallucinations in LLM-generated medical summaries. This is particularly relevant for clinical applications where trust and accuracy are critical. Theoretically, the disparity between automated metrics and human judgment highlights a fundamental challenge in evaluating NLG systems in specialized domains. Future work should explore the integration of LLMs with different summary formats and conduct further clinical validation to assess their real-world impact on patient understanding and health outcomes. Such efforts will require interdisciplinary collaboration to ensure that AI systems in healthcare are not only performant but also safe, reliable, and genuinely beneficial to patients.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 4 tweets with 55 likes about this paper.