Benchmarking Large Language Models for News Summarization (2301.13848v1)

Published 31 Jan 2023 in cs.CL, cs.AI, and cs.LG

Abstract: LLMs have shown promise for automatic summarization but the reasons behind their successes are poorly understood. By conducting a human evaluation on ten LLMs across different pretraining methods, prompts, and model scales, we make two important observations. First, we find instruction tuning, and not model size, is the key to the LLM's zero-shot summarization capability. Second, existing studies have been limited by low-quality references, leading to underestimates of human performance and lower few-shot and finetuning performance. To better evaluate LLMs, we perform human evaluation over high-quality summaries we collect from freelance writers. Despite major stylistic differences such as the amount of paraphrasing, we find that LMM summaries are judged to be on par with human written summaries.

PDF Abstract

Evaluating LLMs for News Summarization

The paper "Benchmarking LLMs for News Summarization" provides a comprehensive analysis of the efficacy of LLMs in the domain of automatic news summarization. This work rigorously evaluates ten diverse LLMs through detailed human evaluations, focusing on understanding the impact of various pretraining methodologies, model scales, and especially the role of instruction tuning.

Key Findings

Among the significant findings, the analysis reveals that instruction tuning, rather than the sheer scale of the model, is paramount for achieving superior zero-shot summarization capabilities. Interestingly, this is demonstrated by the comparable performance of a 350M parameter instruction-tuned GPT-3 against a much larger 175B parameter version lacking such tuning. This challenges conventional assumptions that larger models naturally yield better results.

It is also highlighted that existing benchmarks often suffer from low-quality reference summaries, resulting in misleading performance estimations of LLMs and underestimating both human performance and the benefits of few-shot and finetuning approaches. By employing high-quality summaries generated by skilled freelance writers, the paper provides a more accurate benchmarking environment that compares LLM outputs with human-written summaries. In stark contrast with some current perceptions, LLM-generated summaries are found to be on par with human-created content despite noticeable stylistic differences, such as variations in paraphrasing levels.

Methodological Approach

The research extensively evaluates LLMs using the CNN/Daily Mail (CNN/DM) and XSUM datasets, noting the pitfalls posed by poor-quality reference summaries within these datasets. Instruction-tuned LLMs, such as variations of GPT-3, frequently outperform traditional models fine-tuned on these references, showcasing better coherence and relevance. However, the paper underscores that these comparisons may be skewed by the quality of the reference outputs being suboptimal benchmarks.

The research adopts a systematic human evaluation method, recruiting annotators to judge faithfulness, coherence, and relevance based on a Likert scale. This approach provides a robust mechanism to understand quality beyond standard automated metrics like Rouge-L or METEOR, which show reduced correlation with human judgments, especially when faced with poor references.

Implications and Future Directions

The implications of this research extend both to practical and theoretical realms. Practically, it suggests a paradigm shift towards focusing on improving instruction tuning techniques rather than merely increasing model size for enhanced performance. Theoretically, it challenges the assumption of model scale's primacy in determining output quality, indicating that the nature of task-specific information embedded in fine-grained instruction tuning processes carries significant weight.

Furthermore, this paper highlights potential avenues for future work, such as enhancing training datasets and refining evaluation metrics to more closely align with human judgment. It underlines the need to move beyond assessing models using low-quality references, advocating for more representative human-generated content benchmarks. Additionally, the exploration of how instruction tuning and multi-task learning can be used to glean better insights into model behavior warrants further investigation.

In closing, this paper casts critical light on LLMs' capabilities, emphasizing a nuanced approach to their evaluation—one cognizant of the data quality intricacies and the innovative potential rotating around instruction tuning. As this area develops, these insights will be crucial in guiding the next stages of LLM evolution and their application in automatic summarization tasks.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Tianyi Zhang (262 papers)
Faisal Ladhak (31 papers)
Esin Durmus (38 papers)
Percy Liang (239 papers)
Kathleen McKeown (85 papers)
Tatsunori B. Hashimoto (23 papers)

Citations (378)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos