Papers
Topics
Authors
Recent
Search
2000 character limit reached

Evaluating the Evaluator: Measuring LLMs' Adherence to Task Evaluation Instructions

Published 16 Aug 2024 in cs.AI and cs.CL | (2408.08781v1)

Abstract: LLMs-as-a-judge is a recently popularized method which replaces human judgements in task evaluation (Zheng et al. 2024) with automatic evaluation using LLMs. Due to widespread use of RLHF (Reinforcement Learning from Human Feedback), state-of-the-art LLMs like GPT4 and Llama3 are expected to have strong alignment with human preferences when prompted for a quality judgement, such as the coherence of a text. While this seems beneficial, it is not clear whether the assessments by an LLM-as-a-judge constitute only an evaluation based on the instructions in the prompts, or reflect its preference for high-quality data similar to its fine-tune data. To investigate how much influence prompting the LLMs-as-a-judge has on the alignment of AI judgements to human judgements, we analyze prompts with increasing levels of instructions about the target quality of an evaluation, for several LLMs-as-a-judge. Further, we compare to a prompt-free method using model perplexity as a quality measure instead. We aggregate a taxonomy of quality criteria commonly used across state-of-the-art evaluations with LLMs and provide this as a rigorous benchmark of models as judges. Overall, we show that the LLMs-as-a-judge benefit only little from highly detailed instructions in prompts and that perplexity can sometimes align better with human judgements than prompting, especially on textual quality.

Citations (3)

Summary

  • The paper introduces a novel evaluation taxonomy combining perplexity and prompt engineering to assess LLMs' quality judgments.
  • It compares multiple models and prompt settings using metrics like Pearson correlations, highlighting GPT-4’s superior alignment with human annotations.
  • The findings inform automated evaluation and training optimizations to enhance task-specific reliability and unbiased performance in LLMs.

Evaluating the Evaluator: Measuring LLMs' Adherence to Task Evaluation Instructions

Introduction

The paper explores the methodology of LLMs-as-a-judge, substituting human involvement in task evaluations with LLM-based automated assessments. LLMs such as GPT-4 and Llama3, trained through RLHF, are expected to emulate human judgement when evaluating quality tasks like text coherence. This research scrutinizes whether LLMs adhere to task-specific prompts or mirror their biases rooted in their high-quality fine-tuned data. It introduces a novel taxonomy of qualitative evaluation criteria and positions perplexity as a benchmark alongside explicit prompts for quality judgement.

Evaluating LLMs-as-a-judge

To critically assess LLMs-as-a-judge, multiple prompt settings are evaluated:

  1. Perplexity-Based Evaluation: Relies on LLM perplexity without explicit prompting, measuring the alignment of task solutions with the model's training data.
  2. Generic Prompt: Basic quality assessment with minimal instruction.
  3. Specific Criteria Prompt: Judgment based on named criteria, counting on the model's intrinsic knowledge of these criteria.
  4. Full Rubric Prompt: Includes comprehensive criteria with detailed guidance on scoring. Figure 1

    Figure 1: Radar chart of average Pearson correlations for the quality criteria groups for each of different settings of prompting (1 - Perplexity / No Prompting, 2 - Generic prompt, 3 - Specific prompt, 4 - Full rubric) over all models.

Model Selection

The research encompasses a range of LLMs, including prominent models like GPT-4, Llama3, Mistral, and Phi3. These models were evaluated across various quality metrics and prompt setups using benchmark datasets, including SummEval and TopicalChat, emphasizing diverse aspects like fluency, coherence, and task relevance.

Results and Analysis

Model Level Performance

Results indicate that larger models like GPT-4 exhibit superior alignment with human annotations across all prompt settings, demonstrating a minimal performance variance when detailed rubrics are provided. In contrast, smaller models benefit more significantly from comprehensive instructions.

Dataset Level Insights

Perplexity notably correlates highly with text quality assessments, outperforming minimal prompting in simpler quality judgment scenarios. However, complex tasks necessitate explicit instruction enrichment, as evidenced in datasets like OpinSummEval where nuanced criteria such as sentiment and aspect extraction demand a structured prompt setup.

Criteria Level Analysis

The taxonomy-driven analysis reveals:

  • Content-Based Criteria: Structural text quality (e.g., fluency) aligns robustly with perplexity, while nuanced attributes like naturalness benefit from detailed instructions.
  • Engagement-Based Criteria: Intrinsically subjective, these criteria (e.g., empathy) are less influenced by instruction detail, reflecting high variability and lower correlation with human judgments.
  • Relevance and Integrity Criteria: Instruction enrichment significantly enhances assessments, particularly when model capability aligns with task-specific accuracy needs.

Conclusion

The empirical analysis underscores nuanced prompt engineering's limited impact on LLM performance in judgment tasks, with perplexity providing a compelling benchmark for certain textual evaluations. These findings inform the practical deployment of LLMs in automatic evaluation contexts, guiding future optimizations in training and task-specific model assessment to improve fidelity in human-LLM alignment.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.