Are LLM-generated plain language summaries truly understandable? A large-scale crowdsourced evaluation (2505.10409v1)

Published 15 May 2025 in cs.CL

Abstract: Plain language summaries (PLSs) are essential for facilitating effective communication between clinicians and patients by making complex medical information easier for laypeople to understand and act upon. LLMs have recently shown promise in automating PLS generation, but their effectiveness in supporting health information comprehension remains unclear. Prior evaluations have generally relied on automated scores that do not measure understandability directly, or subjective Likert-scale ratings from convenience samples with limited generalizability. To address these gaps, we conducted a large-scale crowdsourced evaluation of LLM-generated PLSs using Amazon Mechanical Turk with 150 participants. We assessed PLS quality through subjective Likert-scale ratings focusing on simplicity, informativeness, coherence, and faithfulness; and objective multiple-choice comprehension and recall measures of reader understanding. Additionally, we examined the alignment between 10 automated evaluation metrics and human judgments. Our findings indicate that while LLMs can generate PLSs that appear indistinguishable from human-written ones in subjective evaluations, human-written PLSs lead to significantly better comprehension. Furthermore, automated evaluation metrics fail to reflect human judgment, calling into question their suitability for evaluating PLSs. This is the first study to systematically evaluate LLM-generated PLSs based on both reader preferences and comprehension outcomes. Our findings highlight the need for evaluation frameworks that move beyond surface-level quality and for generation methods that explicitly optimize for layperson comprehension.

Summary

The paper found that while LLM-generated plain language summaries (PLSs) received subjective ratings similar to human-authored ones, objective comprehension tests showed human-authored PLSs led to significantly better understanding.
Including background information in plain language summaries is crucial for enhancing layperson comprehension and was a strong predictor of understanding in the study.
Traditional automated metrics like BLEU and ROUGE failed to accurately predict reader comprehension, highlighting the need for new metrics like QAEval that focus on factual consistency and relevance.

Evaluation of LLM-Generated Plain Language Summaries: Insights and Implications

The paper "Are LLM-generated plain language summaries truly understandable? A large-scale crowdsourced evaluation" presents a comprehensive analysis of the efficacy of LLMs in generating plain language summaries (PLSs) compared to those written by human experts. The paper undertakes a large-scale crowdsourced evaluation involving 150 participants to examine both subjective and objective metrics—simplicity, informativeness, coherence, and faithfulness—as well as comprehension outcomes and automated evaluation metrics.

Overview of Research Design and Methodology

The authors have utilized Amazon Mechanical Turk to gather a diverse group of participants, ensuring representation from non-expert audiences, which emulates the target demographic for PLSs in medical communication. The research leverages the CELLS dataset, comprising over 63,000 pairs of scientific abstracts and human-authored PLSs, from which 50 abstract-PLS pairs were randomly sampled.

Six distinct LLM strategies were employed to generate PLSs optimized for different criteria, using GPT-4. The effectiveness of these strategies was compared against human-generated PLSs through subjective ratings and objective comprehension tests. Results were analyzed using linear mixed-effects models to investigate the relationship between human ratings and comprehension performance, as well as the alignment of automated metrics with human judgment.

Key Findings and Insights

Subjective vs. Objective Evaluations:

Participants rated LLM-generated summaries similarly to human-authored ones across dimensions like simplicity and coherence. Yet, objective measures of comprehension, especially multiple-choice question accuracy, revealed that human-authored PLSs facilitated significantly better understanding.

The Role of Background Information:

One striking revelation was the significance of including background information in PLSs. This element was crucial for enhancing layperson comprehension and showed strong predictive association with comprehension performance in mixed-effects models. It underscores the necessity for PLS authors to shift focus from mere simplification to enriching content with informative context.

Limitations of Automated Metrics:

Traditional automated metrics like BLEU and ROUGE were deemed inadequate for predicting reader comprehension, failing to align closely with human evaluations. Conversely, QAEval, a QA-based metric, demonstrated better association with comprehension, highlighting the need for metrics that prioritize factual consistency and content relevance over lexical overlap.

Implications for Future Research and Practice:

The results point toward a pivotal need for developing evaluation frameworks that emphasize comprehension-centered protocols. Furthermore, there is an evident requirement for generation methods in AI that explicitly optimize for layperson understanding beyond surface-level linguistic features.

Speculation on Future Developments in AI

As LLMs continue to evolve, integrating advancements in domain-specific comprehension and factual consistency will likely become central to their applicative roles in health communication. Future AI models could benefit from training datasets enriched with contextual information, extending beyond traditional summarization techniques to incorporate reader-centric engagement strategies. The development of automated metrics that effectively capture the nuances of human comprehension without reliance on n-gram overlap is also anticipated to advance, providing more meaningful benchmarks for model assessment and improvement.

Conclusion

In summary, while LLMs hold potential in generating linguistically proficient PLSs, they currently fall short in enhancing true comprehension compared to human-authored summaries. This paper provides crucial insights and proposes a shift toward comprehension-based evaluation paradigms, advancing our understanding of effective communication strategies in health information dissemination. Through rigorous evaluation and innovative generation methods, future research can better align AI capabilities with the needs of lay audiences, ensuring accessibility and empowerment in healthcare knowledge.