Benchmarking Cognitive Biases in Large Language Models as Evaluators (2309.17012v3)

Published 29 Sep 2023 in cs.CL, cs.AI, and cs.LG

Abstract: LLMs are cognitively biased judges. LLMs have recently been shown to be effective as automatic evaluators with simple prompting and in-context learning. In this work, we assemble 15 LLMs of four different size ranges and evaluate their output responses by preference ranking from the other LLMs as evaluators, such as System Star is better than System Square. We then evaluate the quality of ranking outputs introducing the Cognitive Bias Benchmark for LLMs as Evaluators (CoBBLEr), a benchmark to measure six different cognitive biases in LLM evaluation outputs, such as the Egocentric bias where a model prefers to rank its own outputs highly in evaluation. We find that LLMs are biased text quality evaluators, exhibiting strong indications on our bias benchmark (average of 40% of comparisons across all models) within each of their evaluations that question their robustness as evaluators. Furthermore, we examine the correlation between human and machine preferences and calculate the average Rank-Biased Overlap (RBO) score to be 49.6%, indicating that machine preferences are misaligned with humans. According to our findings, LLMs may still be unable to be utilized for automatic annotation aligned with human preferences. Our project page is at: https://minnesotanlp.github.io/cobbler.

PDF Abstract

Evaluating Cognitive Biases in LLMs: A Critical Assessment

The paper "Benchmarking Cognitive Biases in LLMs as Evaluators" by Koo et al. presents a thorough examination of inherent biases within LLMs when utilized as automatic evaluators. The research focuses on the susceptibility of LLMs to cognitive biases, potentially compromising their reliability in tasks requiring objective evaluation of textual content.

Methodology and Framework

The paper introduces a benchmarking framework, CoBBLEr, designed to assess six distinct cognitive biases in LLMs: Order Bias, Compassion Fade, Egocentric Bias, Salience Bias, Bandwagon Effect, and Attentional Bias. The experimental setup involves evaluating the responses of 15 different LLMs across various size ranges and capabilities. The models were instructed to rank outputs based on perceived quality, with specific configurations designed to isolate biases.

Implicit Biases: The investigation into implicit biases—Order Bias, Compassion Fade, Egocentric Bias, and Salience Bias—reveals how these biases manifest without explicit prompting or adversarial interference. Order Bias and Egocentric Bias were of notable concern, with LLMs frequently favoring their own outputs or the initial options presented.
Induced Biases: The Bandwagon Effect and Attentional Bias required intentional modifications to prompts, revealing the models' vulnerabilities to external and irrelevant influences during text evaluation tasks.

Key Findings

A salient outcome of the paper is the quantitative evidence of bias across all models, particularly pertaining to Bandwagon Effect and Attentional Bias. The average Rank-Biased Overlap (RBO) score between human and machine evaluations was approximately 49.6%, indicating significant misalignment between human judgments and machine preferences.

Models exhibited varying degrees of bias vulnerability. For example, while common biases like Order Bias influenced majority of models notably, larger models (e.g., GPT-4, ChatGPT) were found to have mitigated distractions and external influence in their evaluations. Conversely, smaller models showed increased susceptibility to induced biases, often diverging from human-aligned preferences.

Implications

The practical implications of these findings caution against uncritical reliance on LLMs for automatic text evaluation tasks. The biases identified suggest potential pitfalls in using LLMs for applications that require objectivity, such as quality assurance or content moderation. Specifically, applications like peer review or content evaluation that hinge on unbiased judgment could see compromised quality if LLMs are used without rigorous checks on inherent biases.

Future Directions

To improve reliability and alignment with human preferences, future research should focus on developing bias-mitigation strategies in LLMs, possibly through enhanced fine-tuning or adversarial training methods. The exploration of debiasing techniques, along with leveraging human-in-the-loop approaches, presents a viable pathway towards developing LLMs capable of more accurate and unbiased evaluations.

In conclusion, this paper underscores a critical examination of LLMs as evaluators, presenting vital insights into cognitive biases that affect their performance. The proposed CoBBLEr framework provides an essential tool for further research on bias identification and management, offering directions for future advancements in unbiased LLM evaluation methodologies.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Ryan Koo (6 papers)
Minhwa Lee (7 papers)
Vipul Raheja (21 papers)
Jong Inn Park (4 papers)
Zae Myung Kim (15 papers)
Dongyeop Kang (72 papers)

Citations (44)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

Benchmarking Cognitive Biases in Large Language Models as Evaluators | cobbler

Tweets

https://twitter.com/_vipulraheja/status/1823055536667140132

YouTube

Show All Videos