Large Language Models are not Fair Evaluators (2305.17926v2)

Published 29 May 2023 in cs.CL, cs.AI, and cs.IR

Abstract: In this paper, we uncover a systematic bias in the evaluation paradigm of adopting LLMs~(LLMs), e.g., GPT-4, as a referee to score and compare the quality of responses generated by candidate models. We find that the quality ranking of candidate responses can be easily hacked by simply altering their order of appearance in the context. This manipulation allows us to skew the evaluation result, making one model appear considerably superior to the other, e.g., Vicuna-13B could beat ChatGPT on 66 over 80 tested queries with ChatGPT as an evaluator. To address this issue, we propose a calibration framework with three simple yet effective strategies: 1) Multiple Evidence Calibration, which requires the evaluator model to generate multiple evaluation evidence before assigning ratings; 2) Balanced Position Calibration, which aggregates results across various orders to determine the final score; 3) Human-in-the-Loop Calibration, which introduces a balanced position diversity entropy to measure the difficulty of each example and seeks human assistance when needed. We also manually annotate the "win/tie/lose" outcomes of responses from ChatGPT and Vicuna-13B in the Vicuna Benchmark's question prompt, and extensive experiments demonstrate that our approach successfully mitigates evaluation bias, resulting in closer alignment with human judgments. We release our code and human annotation at \url{https://github.com/i-Eval/FairEval} to facilitate future research.

Citations (412)

View on Semantic Scholar

Summary

The paper identifies a critical positional bias where the order of responses significantly skews LLM evaluation rankings.
Empirical tests show that swapping response order leads to notable fluctuations in judgments by GPT-4 and ChatGPT.
The authors propose a calibration framework (MEC, BPC, HITLC) that improves alignment with human assessments by up to 14.3%.

LLMs are not Fair Evaluators: A Critical Examination

The paper "LLMs are not Fair Evaluators" introduces a significant issue affecting the reliability of utilizing LLMs such as GPT-4 and ChatGPT as evaluators in assessing the quality of model-generated responses. This work explores a positional bias that compromises the fairness and objectivity of these models when they are employed as evaluators. Positional bias refers to the tendency of the LLMs to skew evaluation results based solely on the order of candidate responses.

The paper highlights the critical observation that the quality ranking of candidate models can be substantially manipulated by merely changing the sequence in which responses are presented. This manipulation was evidenced by testing scenarios such as Vicuna-13B compared against ChatGPT, where altering response order led to a wide fluctuation in judgment outcomes. The implication of this phenomenon suggests that current methodologies using LLMs as evaluators may surface unintended biases, leading to unreliable evaluations. For instance, ChatGPT exhibited a strong inclination towards preferring the second response, whereas GPT-4 showed a bias towards the first.

In response to the identified bias problem, the authors propose a comprehensive calibration framework consisting of three main strategies: Multiple Evidence Calibration (MEC), Balanced Position Calibration (BPC), and Human-in-the-Loop Calibration (HITLC).

Multiple Evidence Calibration (MEC): This strategy involves prompting the LLM evaluator to generate multiple pieces of evaluation evidence before forming a conclusion on the scores assigned to the responses. By producing numerous justifications, the evaluations inherently gain stability and are less prone to positional variations.
Balanced Position Calibration (BPC): This approach applies a counterbalance by swapping the order of responses across evaluation rounds and averaging out the scores to mitigate positional preference. This strategy is rooted in exposing the model to symmetric information to ascertain a balanced judgment.
Human-in-the-Loop Calibration (HITLC): HITLC adds a layer of human judgment where diversity-based measures are applied to probe examples with higher susceptibility to bias. By integrating human evaluators selectively, the accuracy correlating with human judgments is significantly heightened.

The investigation asserts that leveraging these calibration methods improves the alignment of LLM evaluations with human assessments. In empirical studies, the accuracy and kappa correlation coefficient of GPT-4 and ChatGPT were notably increased when these calibrations were applied, demonstrating a closer match to human judgment. This improvement was quantified with enhancements of 9.8% for GPT-4 and 14.3% for ChatGPT in experimental settings, indicating a substantial mitigation of bias through this framework.

The implications of this research are manifold. Practically, the frameworks proposed allow for more reliable automated evaluations, decreasing reliance on expensive and time-consuming human assessments while maintaining robustness and fairness in evaluations. Theoretically, the paper shines light on intrinsic biases that may permeate LLMs across diverse applications, urging developers and researchers to incorporate bias mitigation strategies early in the deployment of LLMs for evaluation tasks.

The paper anticipates that future developments in AI evaluation might increasingly embody bias calibration mechanisms, enhancing the reliability and fairness of automated assessments. The paper emphasizes the importance of continuing to explore, detect, and rectify biases in LLM deployment contexts, propagating advancements in generating better-aligned AI systems with human intent.

Overall, the paper contributes to the discourse on AI reliability and fairness, providing actionable strategies to overcome identified biases in LLM evaluation processes. Researchers and developers are encouraged to adopt and build upon these insights to enhance the efficacy and fairness of AI technologies in practice.

Related Papers

GitHub

GitHub - i-Eval/FairEval (135 stars)

Tweets

https://twitter.com/Zefan_Cai/status/1865304823781843385

https://twitter.com/leonardtang_/status/1773830680494932189

https://twitter.com/BenatTrilogy/status/1766930477846302936

https://twitter.com/kadarakos/status/1867579219951153185

YouTube

Show All Videos