Large Language Models are Not Yet Human-Level Evaluators for Abstractive Summarization (2305.13091v2)

Published 22 May 2023 in cs.CL

Abstract: With the recent undeniable advancement in reasoning abilities in LLMs like ChatGPT and GPT-4, there is a growing trend for using LLMs on various tasks. One area where LLMs can be employed is as an alternative evaluation metric for complex generative tasks, which generally demands expensive human judges to complement the traditional automatic metrics for various evaluation dimensions such as fluency and consistency. In this work, we conduct extensive analysis to investigate the stability and reliability of LLMs as automatic evaluators for abstractive summarization. We found that while ChatGPT and GPT-4 outperform the commonly used automatic metrics, they are not ready as human replacements due to significant limitations. That is, LLM evaluators rate each candidate system inconsistently and are dimension-dependent. They also struggle to compare candidates with close performance and become more unreliable with higher-quality summaries by obtaining a lower correlation with humans. In other words, with better abstractive summarization systems being introduced at a fast pace, LLMs may result in misleading and unreliable evaluations.

PDF Abstract

Evaluating Abstractive Summarization: Limitations of LLMs as Human-Level Evaluators

The research paper "LLMs are Not Yet Human-Level Evaluators for Abstractive Summarization" rigorously examines the roles of LLMs, particularly ChatGPT and GPT-4, in the automatic evaluation of summarization tasks. As LLMs gain prominence due to their advanced reasoning capabilities, it's crucial to ascertain their reliability as evaluators, especially in tasks such as abstractive summarization where traditional metrics like Rouge and BERTScore are insufficient.

Overview of the Study

The paper provides a detailed analysis of the inherent limitations of LLM evaluators across four key dimensions: coherence, consistency, fluency, and relevance. Researchers employed three different methods: Likert-scale scoring via Reason-then-Score (RTS) and Multiple-choice Question (MCQ) approaches, as well as Head-to-Head (H2H) comparison. Despite outperforming existing metrics, LLMs are highlighted as potential, rather than reliable, replacements for human evaluators.

Findings and Numerical Results

Several noteworthy observations emerged from the research:

Correct Preferences: On the standard 66-pair test set, ChatGPT-RTS achieved a notably high accuracy of 88.6% in selecting system preferences consistent with human judgment. This included identifying superior summarization systems or ties in comparison pairs.
Correlation with Human Evaluations: Across 1200 summaries, ChatGPT showcased stronger correlations, achieving up to a 0.2 increase over previous automatic metrics like BARTScore in fluency. However, GPT-4 displayed improved correlations, notably in consistency, attributed to its reduced hallucination rate.
Single-Candidate Analysis and Meta-Correlation: The paper revealed significant variation in LLM evaluation reliability across different summarization systems, exemplifying the instability of LLM evaluators. The meta-correlation analysis indicated undesirable variability in alignment with humans across higher-quality summarization systems, particularly for ChatGPT-RTS and BARTScore.
Scores Discrepancy: ChatGPT-RTS scores were consistently lower than human scores across all dimensions. This may stem from the LLM's incorrect reasoning causing unjustified penalties.

Implications and Recommendations

The insightful findings necessitate caution when using LLMs as standalone evaluators. The paper suggests that LLMs exhibit candidate and dimension dependencies, making them less reliable compared to humans, especially for high-quality summarization systems.

Practically, the research proposes a temporary framework that leverages the correlation between multiple evaluation methods (RTS and MCQ) to gauge the LLM's reliability. This aims to minimize over-reliance on LLMs by distinguishing when supplementary human evaluations are necessary.

Speculation on Future AI Developments

The paper anticipates that addressing these limitations will form a cornerstone for advancing LLM applications. Improving model training to maintain consistency across systems and dimensions could enhance reliability as summarization evaluators. Furthermore, evolving consistent evaluation metrics aligned with human expectations will likely remain a focus in AI research.

In conclusion, while LLMs show promise as supplementary evaluators, they cannot yet replace human assessments. As future developments unfold, integrating LLM evaluations with enhanced monitoring of their reliability may pave a path towards more autonomous summarization evaluation systems.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Chenhui Shen (7 papers)
Liying Cheng (16 papers)
Xuan-Phi Nguyen (22 papers)
Yang You (173 papers)
Lidong Bing (144 papers)

Citations (50)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos