Unraveling the Capabilities of Language Models in News Summarization (2501.18128v1)

Published 30 Jan 2025 in cs.CL and cs.AI

Abstract: Given the recent introduction of multiple LLMs and the ongoing demand for improved Natural Language Processing tasks, particularly summarization, this work provides a comprehensive benchmarking of 20 recent LLMs, focusing on smaller ones for the news summarization task. In this work, we systematically test the capabilities and effectiveness of these models in summarizing news article texts which are written in different styles and presented in three distinct datasets. Specifically, we focus in this study on zero-shot and few-shot learning settings and we apply a robust evaluation methodology that combines different evaluation concepts including automatic metrics, human evaluation, and LLM-as-a-judge. Interestingly, including demonstration examples in the few-shot learning setting did not enhance models' performance and, in some cases, even led to worse quality of the generated summaries. This issue arises mainly due to the poor quality of the gold summaries that have been used as reference summaries, which negatively impacts the models' performance. Furthermore, our study's results highlight the exceptional performance of GPT-3.5-Turbo and GPT-4, which generally dominate due to their advanced capabilities. However, among the public models evaluated, certain models such as Qwen1.5-7B, SOLAR-10.7B-Instruct-v1.0, Meta-Llama-3-8B and Zephyr-7B-Beta demonstrated promising results. These models showed significant potential, positioning them as competitive alternatives to large models for the task of news summarization.

PDF Abstract

An Evaluation of LLMs for News Summarization

In the challenging field of news summarization, this paper provides an extensive benchmarking of 20 recent LLMs (LMs), with a particular focus on small models. Through a series of well-designed experiments, the authors test these models' performance across three distinct datasets: CNN/Daily Mail (CNN/DM), Newsroom, and Extreme Summarization (XSum). The paper emphasizes both zero-shot and few-shot learning contexts, employing diverse evaluation metrics—automatic, human, and LLM-based—to comprehensively assess the models' capabilities.

Key Findings and Methodological Approach

Dataset Characteristics and Challenges: The paper identifies several quality and consistency issues within the gold summaries of these datasets, which inhibit accurate performance evaluation and model fine-tuning. Despite these challenges, the datasets—large in scale and commonly used in news summarization research—remain integral benchmarks in the field. Notably, the authors observe that the CNN/DM dataset encourages more extractive summarization styles, while the XSum dataset demands the generation of highly condensed one-sentence summaries.

Model Performance: The OpenAI GPT models, despite their extensive computational demands, consistently perform exceptionally across all experiments. GPT-3.5-Turbo, in particular, achieves the highest automated metric scores in several instances, although human evaluators often favor GPT-4 for its semantic robustness and summarization quality. Among the evaluated smaller models, instances such as Qwen1.5-7B and SOLAR-Instruct-v1.0 demonstrate promising performance, occasionally rivaling larger models in generating coherent and relevant summaries.

Evaluation Metrics: The paper employs ROUGE, METEOR, and BERTScore for automated evaluation, indicative of a thorough approach in understanding both surface-level and semantic content overlap. The inclusion of both human and LLM-based evaluation further reinforces the paper's robustness, capturing linguistic nuances that automated metrics might overlook.

Few-Shot Learning Analysis: Importantly, few-shot learning does not yield performance improvements as expected. The poor quality of gold summaries provided in demonstrations confounds model performance, suggesting that fine-tuning with higher-quality datasets could potentially realize more gains.

Implications and Future Directions

Data Quality and Evaluation Techniques: This research identifies data quality as a key barrier to effective summary generation and model training. Future research should focus on curating high-quality, human-verified summaries to enhance model training and evaluation. Moreover, expanding evaluation frameworks to include multilingual datasets could broaden the relevance and applicability of these findings.

Optimization of Smaller Models: While large models currently dominate, smaller models remain appealing for their efficiency and flexibility. Future work should explore optimized generation settings and domain-specific training data to enhance their performance further, potentially offering cost-effective alternatives for real-world applications.

Advanced Evaluation Practices: The alignment between human judgments and advanced LMs using AI-based evaluation introduces a scalable and cost-effective means to assess new models. Continued exploration into this evaluation method will be crucial to maintaining rigorous standards in the rapidly evolving field of LM research.

Ultimately, this paper significantly contributes to the understanding of LMs' capabilities in news summarization, providing foundational insights and setting a benchmark for future developments in natural language processing and generative AI systems.

PDF Markdown Bookmark Chat (Pro)

Authors (2)

Abdurrahman Odabaşı (1 paper)
Göksel Biricik (2 papers)

Tweets

https://twitter.com/arXivGPT/status/1886838893925826978

Unraveling the Capabilities of Language Models in News Summarization (2501.18128v1)

An Evaluation of LLMs for News Summarization

Key Findings and Methodological Approach

Implications and Future Directions

Related Papers

Tweets