Evaluating Large Language Models at Evaluating Instruction Following (2310.07641v2)

Published 11 Oct 2023 in cs.CL and cs.LG

Abstract: As research in LLMs continues to accelerate, LLM-based evaluation has emerged as a scalable and cost-effective alternative to human evaluations for comparing the ever increasing list of models. This paper investigates the efficacy of these ``LLM evaluators'', particularly in using them to assess instruction following, a metric that gauges how closely generated text adheres to the given instruction. We introduce a challenging meta-evaluation benchmark, LLMBar, designed to test the ability of an LLM evaluator in discerning instruction-following outputs. The authors manually curated 419 pairs of outputs, one adhering to instructions while the other diverging, yet may possess deceptive qualities that mislead an LLM evaluator, e.g., a more engaging tone. Contrary to existing meta-evaluation, we discover that different evaluators (i.e., combinations of LLMs and prompts) exhibit distinct performance on LLMBar and even the highest-scoring ones have substantial room for improvement. We also present a novel suite of prompting strategies that further close the gap between LLM and human evaluators. With LLMBar, we hope to offer more insight into LLM evaluators and foster future research in developing better instruction-following models.

Citations (125)

View on Semantic Scholar

Summary

The paper presents LLMBar, a new benchmark designed to evaluate LLMs’ ability to follow instructions using both natural and adversarial datasets.
The paper reveals that even top-performing GPT-4-based evaluators lag behind expert human annotators, especially with adversarial outputs.
The paper examines advanced prompting strategies and bias mitigation techniques to enhance evaluation accuracy for instruction adherence.

Evaluating LLMs on Instruction Following

The paper presents an insightful investigation into the evaluation of LLMs with a focus on their capability to follow instructions accurately. The research introduces a novel benchmark named LLMBar, which serves as a meta-evaluation tool to discern the efficacy of different LLM evaluators in assessing instruction-following outputs. At the core of this paper is the acknowledgement that as LLMs grow in complexity and ubiquity, a robust framework to evaluate their adherence to given instructions becomes indispensable.

Benchmark Design and Evaluation

LLMBar is meticulously designed to test LLMs on their ability to distinguish between outputs that genuinely follow instructions and those that diverge despite appearing qualitatively superior. The benchmark includes a collection of 419 instances across two main sets: Natural and Adversarial. The Natural set contains instances from existing preference datasets that have been rigorously filtered and modified to ensure objective preference based on instruction-following quality. The Adversarial set, in contrast, comprises outputs that, while deviating from the intended instructions, possess appealing superficial qualities that could mislead less adept evaluators.

The authors evaluate several LLMs—GPT-4, ChatGPT, LLaMA-2, PaLM2, and Falcon—and pair these models with different prompting strategies to test their performance as evaluators on LLMBar. One key finding is that the performance of these evaluators varies considerably across both Natural and Adversarial sets. Notably, the paper highlights that even the best-performing GPT-4-based evaluators exhibit a significant gap compared to expert human annotators, particularly within the Adversarial set.

Prompting Strategies and Findings

The paper explores a variety of prompting strategies to enhance the evaluators' performance. Among them, a combination of self-generated metrics and reference outputs, imbued with explicit rules prioritizing instruction adherence, showed significant promise. This approach boosts the evaluators' understanding of instruction-following over superficial qualities. The paper underscores that despite improvements with enhanced strategies, most LLM evaluators still fall short of human-level accuracy, particularly in the challenging instances where the outputs are crafted to be superficially appealing but instruction-divergent.

Furthermore, the research investigates common biases such as positional biases in LLM evaluations and introduces the Swap strategy, which helps mitigate these biases by synthesizing evaluations from outputs presented in both orders. The results from LLMBar also reveal a stark contrast between LLMBar and other meta-evaluation benchmarks regarding evaluator performance differentiation, consolidating its role as a more nuanced tool for evaluating LLMs' instruction-following capabilities.

Implications and Future Directions

This paper presents compelling evidence of the limitations current LLM evaluators face, particularly concerning instruction adherence. The findings have practical implications for both AI researchers and practitioners intending to deploy LLMs for tasks requiring high fidelity to given instructions. The benchmark LLMBar offers an innovative pathway for future research aimed at developing more reliable evaluators and benchmarks that can accurately reflect the nuanced capabilities of LLMs in real-world contexts.

Looking ahead, there is potential for expanding LLMBar to encompass other critical evaluation criteria, such as factual correctness and non-toxicity in LLMs, as well as extending its scope to multi-turn evaluations. These future developments could further solidify LLMBar's position as a comprehensive evaluation tool in the ever-evolving landscape of LLMs.

In conclusion, the introduction of LLMBar represents a significant step toward refining the meta-evaluation of LLMs, offering deeper insights and fostering advancements in building models and evaluators that more precisely align with the desirable qualities of instruction-following and beyond.

Related Papers

GitHub

GitHub - princeton-nlp/LLMBar: Evaluating Large Language Models at Evaluating Instruction Following. Paper: https://arxiv.org/abs/2310.07641 (127 stars)