How Reliable Are Automatic Evaluation Methods for Instruction-Tuned LLMs? (2402.10770v4)

Published 16 Feb 2024 in cs.CL and cs.AI

Abstract: Work on instruction-tuned LLMs has used automatic methods based on text overlap and LLM judgments as cost-effective alternatives to human evaluation. In this paper, we perform a meta-evaluation of such methods and assess their reliability across a broad range of tasks. In evaluating how well automatic methods align with human evaluations, correlation metrics are the most commonly employed method despite their inherent limitations when dealing with ties and different scales. To address these shortcomings, we use Pairwise Accuracy as an alternative to standard correlation measures. We observe that while automatic evaluation methods can approximate human ratings under specific conditions, their validity is highly context-dependent. Specifically, the simple ROUGE-L metric correlates very well with human ratings for short-answer English tasks but is unreliable in free-form generation tasks and cross-lingual scenarios. The effectiveness of the more advanced method of using GPT-4 as a judge diminishes significantly if reference answers are not included in the prompt, which is the scenario where this method has the potential to provide the most value compared to other metrics. Our findings enhance the understanding of how automatic methods should be applied and interpreted when developing and evaluating instruction-tuned LLMs.

PDF Abstract

The paper "How Reliable Are Automatic Evaluation Methods for Instruction-Tuned LLMs?" explores the trustworthiness of automatic evaluation methods in assessing LLMs that have been fine-tuned to follow instructions. This investigation is crucial because LLMs are widely used in natural language processing to tackle various tasks, and their effectiveness hinges on reliable evaluation methods.

Background and Importance

Instruction-tuned LLMs have transformed natural language processing by demonstrating an impressive ability to learn new tasks from instructions. The evaluation of these models traditionally relies on human judgment, which, although considered the gold standard, can be costly and time-intensive. Therefore, automatic evaluation methods, such as those measuring text overlap (like the ROUGE-L metric) and using LLMs as judges (like GPT-4), are attractive alternatives. Understanding the reliability of these methods is essential to ensure we can efficiently and accurately gauge the performance of LLMs in various contexts, including different tasks and languages.

Study Overview

The paper presents a comprehensive analysis of the effectiveness of two prominent evaluation methods, ROUGE-L and GPT-4, against human evaluations across multiple tasks and two languages (English and Swedish). It was found that the reliability of automatic evaluation methods is highly dependent on the specific task and language context.

ROUGE-L Performance:
- Short-answer tasks: ROUGE-L correlates strongly with human judgments in simple English tasks that expect brief responses, aligning well with human review.
- Free-form generation tasks: It proves less reliable when assessing tasks that require longer, more complex responses. This is due to its reliance on matching words between the generated and standard responses.
- Cross-lingual issues: ROUGE-L's effectiveness diminishes when applied to non-English tasks, partly due to different language structures and informal translations introducing noise.
LLM-as-a-Judge with GPT-4:
- Presence of Reference Answers: Including reference answers when prompting GPT-4 helps in evaluations, although it can result in stricter assessments when model outputs diverge from these references.
- Task Type Effect: Similar to ROUGE-L, GPT-4's judgments align more with human ratings for tasks with expected short answers. For tasks requiring long, detailed outputs, providing reference answers can sometimes limit GPT-4's ability to judge creativity and correctness comprehensively.

Recommendations and Pitfalls

The paper suggests some nuanced insights into optimizing the use of these automatic evaluation methods:

While these can be cost-effective proxies for human evaluation under specific conditions, they should be applied judiciously, and their context-dependent limitations should be accounted for.
Variability in effectiveness across tasks and languages should be acknowledged, with methods tailored to task-specific requirements for optimal outcomes.
For tasks with diverse possible responses, especially in free-form generation, a mismatch in judgments indicates that automatic evaluations may not fully capture the model's performance, emphasizing the need for careful gold standard design.

Conclusion

Overall, the paper reinforces the view that although automated methods such as ROUGE-L and GPT-4 can mirror human judgment in certain settings, dependency on these alone may not provide a full picture of an LLM's capabilities, especially in multilingual and open-ended tasks. Continuous refinement and understanding of their application contexts are vital in leveraging them effectively as a substitute for human evaluation.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Ehsan Doostmohammadi (11 papers)
Oskar Holmström (4 papers)
Marco Kuhlmann (13 papers)

Citations (3)

View on Semantic Scholar

Related Papers

Find Related Papers