Evaluating Abstractive Summarization: Limitations of LLMs as Human-Level Evaluators
The research paper "LLMs are Not Yet Human-Level Evaluators for Abstractive Summarization" rigorously examines the roles of LLMs, particularly ChatGPT and GPT-4, in the automatic evaluation of summarization tasks. As LLMs gain prominence due to their advanced reasoning capabilities, it's crucial to ascertain their reliability as evaluators, especially in tasks such as abstractive summarization where traditional metrics like Rouge and BERTScore are insufficient.
Overview of the Study
The paper provides a detailed analysis of the inherent limitations of LLM evaluators across four key dimensions: coherence, consistency, fluency, and relevance. Researchers employed three different methods: Likert-scale scoring via Reason-then-Score (RTS) and Multiple-choice Question (MCQ) approaches, as well as Head-to-Head (H2H) comparison. Despite outperforming existing metrics, LLMs are highlighted as potential, rather than reliable, replacements for human evaluators.
Findings and Numerical Results
Several noteworthy observations emerged from the research:
- Correct Preferences: On the standard 66-pair test set, ChatGPT-RTS achieved a notably high accuracy of 88.6% in selecting system preferences consistent with human judgment. This included identifying superior summarization systems or ties in comparison pairs.
- Correlation with Human Evaluations: Across 1200 summaries, ChatGPT showcased stronger correlations, achieving up to a 0.2 increase over previous automatic metrics like BARTScore in fluency. However, GPT-4 displayed improved correlations, notably in consistency, attributed to its reduced hallucination rate.
- Single-Candidate Analysis and Meta-Correlation: The paper revealed significant variation in LLM evaluation reliability across different summarization systems, exemplifying the instability of LLM evaluators. The meta-correlation analysis indicated undesirable variability in alignment with humans across higher-quality summarization systems, particularly for ChatGPT-RTS and BARTScore.
- Scores Discrepancy: ChatGPT-RTS scores were consistently lower than human scores across all dimensions. This may stem from the LLM's incorrect reasoning causing unjustified penalties.
Implications and Recommendations
The insightful findings necessitate caution when using LLMs as standalone evaluators. The paper suggests that LLMs exhibit candidate and dimension dependencies, making them less reliable compared to humans, especially for high-quality summarization systems.
Practically, the research proposes a temporary framework that leverages the correlation between multiple evaluation methods (RTS and MCQ) to gauge the LLM's reliability. This aims to minimize over-reliance on LLMs by distinguishing when supplementary human evaluations are necessary.
Speculation on Future AI Developments
The paper anticipates that addressing these limitations will form a cornerstone for advancing LLM applications. Improving model training to maintain consistency across systems and dimensions could enhance reliability as summarization evaluators. Furthermore, evolving consistent evaluation metrics aligned with human expectations will likely remain a focus in AI research.
In conclusion, while LLMs show promise as supplementary evaluators, they cannot yet replace human assessments. As future developments unfold, integrating LLM evaluations with enhanced monitoring of their reliability may pave a path towards more autonomous summarization evaluation systems.