Evaluating the Evaluator: Measuring LLMs' Adherence to Task Evaluation Instructions (2408.08781v1)

Published 16 Aug 2024 in cs.AI and cs.CL

Abstract: LLMs-as-a-judge is a recently popularized method which replaces human judgements in task evaluation (Zheng et al. 2024) with automatic evaluation using LLMs. Due to widespread use of RLHF (Reinforcement Learning from Human Feedback), state-of-the-art LLMs like GPT4 and Llama3 are expected to have strong alignment with human preferences when prompted for a quality judgement, such as the coherence of a text. While this seems beneficial, it is not clear whether the assessments by an LLM-as-a-judge constitute only an evaluation based on the instructions in the prompts, or reflect its preference for high-quality data similar to its fine-tune data. To investigate how much influence prompting the LLMs-as-a-judge has on the alignment of AI judgements to human judgements, we analyze prompts with increasing levels of instructions about the target quality of an evaluation, for several LLMs-as-a-judge. Further, we compare to a prompt-free method using model perplexity as a quality measure instead. We aggregate a taxonomy of quality criteria commonly used across state-of-the-art evaluations with LLMs and provide this as a rigorous benchmark of models as judges. Overall, we show that the LLMs-as-a-judge benefit only little from highly detailed instructions in prompts and that perplexity can sometimes align better with human judgements than prompting, especially on textual quality.

PDF HTML Abstract

Evaluating the Evaluator: Measuring LLMs’ Adherence to Task Evaluation Instructions

The paper "Evaluating the Evaluator: Measuring LLMs' Adherence to Task Evaluation Instructions" presents a comprehensive paper on the effectiveness of LLMs as automated evaluators, specifically focusing on their ability to adhere to detailed task instructions. With the rise in the use of LLMs as replacements for human judgment in evaluating tasks such as natural language generation (NLG), there is a significant interest in understanding how these models perform when tasked as judges. This paper is pivotal in reflecting on the accuracy and reliability of LLMs' evaluations compared to human annotations and the impact of prompt instructions on model performance.

Key Contributions

Development of a Taxonomy for Evaluation Criteria: The authors propose a novel taxonomy for evaluating qualitative criteria useful in assessing LLMs-as-a-judge. This taxonomy consists of four primary categories: Content, Relevance, Integrity, and Engagement. Through categorizing 34 metrics, they set a structured and comprehensive approach to evaluate LLMs' performance on various tasks, specifically using eight benchmark datasets.
Systematic Evaluation Across Models: The paper conducts a meticulous evaluation of several major LLM families, including GPT4, Llama3, Mistral, and Phi3, across four different levels of prompt instructions. This evaluation is intended to measure how increasing the granularity of evaluation rubrics impacts models' performance and alignment with human judgments.
Assessment of Perplexity as an Evaluation Metric: A key finding of this paper is the effectiveness of model perplexity as a potential metric for automatic evaluations, in some cases even outperforming direct prompting strategies. In particular, model perplexity exhibited higher correlation with human judgments on content-related criteria.

Results and Findings

The research finds that detailed prompting does not significantly enhance the correlation of LLM evaluations with human judgments, as the gain is limited to only about 4% when using full rubric instructions. In many cases, particularly with simple textual quality evaluations, model perplexity not only matches but sometimes exceeds the alignment seen with explicitly instructed models. This suggests that, particularly for text coherence and fluency, training data alignment (as indicated by perplexity) may reveal sufficient quality insights.

For categories where complexity and subjectivity are higher, such as engagement and relevance, more intricate input and rubrics still offer value in enhancing evaluation reliability. It was observed that larger and more robust models like GPT-4 outperform others across most criteria and settings, indicating the significant impact of model scale and sophistication on evaluation accuracy.

Implications and Future Directions

The implications of these findings are twofold: for practical applications, model perplexity offers an easier, reference-free method for evaluating LLM outputs in simpler textual tasks, facilitating more scalable and less resource-intensive evaluation processes. Theoretically, the paper suggests that model training data and inherent biases significantly aid or hinder models' judgment capabilities, hinting at future research directions into how these biases can be explicitly managed and utilized.

Future developments might focus on refining LLM architectures or enhancing model training processes to inherently support more nuanced and diverse evaluative judgments without extensive external instruction. Additionally, the exploration of integrating these findings into automated systems could lead to more autonomous LLM-based evaluation mechanisms, further reducing the need for human oversight in quality evaluation.

In conclusion, this paper contributes a critical evaluation of current LLM capabilities in task assessments, highlighting both strengths and areas for improvement, encouraging further research and development in automatic task evaluations using machine intelligence.

PDF Markdown Bookmark Chat (Pro)

Authors (8)

Bhuvanashree Murugadoss (1 paper)
Christian Poelitz (8 papers)
Ian Drosos (9 papers)
Vu Le (26 papers)
Nick McKenna (8 papers)
Carina Suzana Negreanu (2 papers)
Chris Parnin (19 papers)
Advait Sarkar (25 papers)

Citations (3)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/devangt/status/1825573756456284162

YouTube

Show All Videos