FactPICO: Factuality Evaluation for Plain Language Summarization of Medical Evidence (2402.11456v2)

Published 18 Feb 2024 in cs.CL

Abstract: Plain language summarization with LLMs can be useful for improving textual accessibility of technical content. But how factual are these summaries in a high-stakes domain like medicine? This paper presents FactPICO, a factuality benchmark for plain language summarization of medical texts describing randomized controlled trials (RCTs), which are the basis of evidence-based medicine and can directly inform patient treatment. FactPICO consists of 345 plain language summaries of RCT abstracts generated from three LLMs (i.e., GPT-4, Llama-2, and Alpaca), with fine-grained evaluation and natural language rationales from experts. We assess the factuality of critical elements of RCTs in those summaries: Populations, Interventions, Comparators, Outcomes (PICO), as well as the reported findings concerning these. We also evaluate the correctness of the extra information (e.g., explanations) added by LLMs. Using FactPICO, we benchmark a range of existing factuality metrics, including the newly devised ones based on LLMs. We find that plain language summarization of medical evidence is still challenging, especially when balancing between simplicity and factuality, and that existing metrics correlate poorly with expert judgments on the instance level.

Citations (5)

View on Semantic Scholar

Summary

The paper introduces FACT PICO, a benchmark assessing factual accuracy of plain language RCT summaries using detailed PICO elements.
It reveals that traditional factuality metrics weakly correlate with expert judgments, while LLM-based, domain-specific evaluations show promising alignment.
The study emphasizes the need for explainable, reliable AI methods in medicine to ensure accurate and accessible dissemination of complex medical evidence.

Factuality Evaluation for Plain Language Summarization of Medical Evidence

FACT PICO: A Fine-Grained Benchmark

The evolution of LLMs (LMs), particularly in high-stakes domains like medicine, necessitates rigorous factuality checks to ensure the reliability of generated plain language summaries. Understanding the implications of potentially non-factual summarizations in medicine is paramount, given its direct impact on patient care and knowledge dissemination amongst non-experts. The paper introduces FACT PICO, a benchmark designed for the factuality assessment of plain language summarizations of medical texts, focusing on Randomized Controlled Trials (RCTs). This fine-grained benchmark evaluates summaries generated by three well-known LMs: GPT-4, Llama-2, and Alpaca, across key RCT elements encapsulated by the PICO framework (Populations, Interventions, Comparators, Outcomes) and the correctness of added explanations.

Rationale and Background

Medical literature, particularly RCTs, forms the cornerstone of evidence-based medicine. However, the technical nature of these texts often limits their accessibility to healthcare professionals and researchers. There is a growing interest in employing LMs to bridge this gap by generating plain language summaries that can make complex medical findings comprehensible to lay readers. This approach not only democratizes access to the latest medical research but also has potential implications for enhancing patient understanding and involvement in their care. Nevertheless, the advent of such technology raises pertinent questions about the factual accuracy of these automatically generated summaries—a critical concern in the medical domain.

FACT PICO Benchmark Design

Engineered as a solution to the outlined problem, FACT PICO comprises 345 plain language summaries of RCT abstracts, evaluated against a fine-grained criterion that emphasizes PICO elements and the factuality of additional explanatory content. The benchmark extends beyond mere factual validation, incorporating expert-written rationales that provide crucial insights into the reasoning behind the factuality scores assigned to each summary element. This detailed structure aims to encourage the development of explainable and reliable factuality evaluation methods for medical text summarization.

Evaluation Framework and Findings

Using FACT PICO, the paper explores the efficacy of existing factuality metrics while proposing new LLM-based evaluations for the same. Through comparative analysis, the research uncovers a nuanced landscape where existing metrics exhibit weak correlations with expert judgments on an instance level, highlighting the shortcomings of current evaluation approaches. Notably, LLM-based metrics that incorporate domain-specific knowledge showcase promising alignment with expert evaluations, emphasizing the need for targeted evaluation frameworks in specialized fields like medicine.

Implications and Future Directions

The FACT PICO benchmark not only sets a new standard for assessing the factuality of medical text summarization but also opens avenues for subsequent research in the field. Its emphasis on explainability and fine-grained analysis paves the way for more nuanced and reliable evaluation methods. Moreover, the insights gained from applying FACT PICO underline the critical need for advancements in summarization technologies that can faithfully represent complex medical information in an accessible format. As the landscape of generative AI continues to evolve, benchmarks like FACT PICO will play a pivotal role in guiding the development of ethically responsible and factually accurate AI applications in medicine.

Conclusion

In conclusion, FACT PICO emerges as a critical tool in the ongoing effort to enhance the accessibility and reliability of medical knowledge dissemination through AI-driven summarization. Its comprehensive evaluation framework not only addresses the immediate need for rigorous factuality checks in medical text summarization but also sets the stage for future advancements in AI's application to medicine. As we move forward, the continued refinement and adoption of such benchmarks will be essential in harnessing the full potential of AI to serve the public good in healthcare and beyond.

Related Papers

Tweets

https://twitter.com/lilyychenn/status/1759971438637793382

https://twitter.com/jessyjli/status/1773545649385906285

https://twitter.com/jessyjli/status/1822032312332382415

https://twitter.com/knishimae0531/status/1760105982091465127