Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The Comparative Trap: Pairwise Comparisons Amplifies Biased Preferences of LLM Evaluators (2406.12319v3)

Published 18 Jun 2024 in cs.CL

Abstract: As LLMs are increasingly used as evaluators for natural language generation tasks, ensuring unbiased assessments is essential. However, LLM evaluators often display biased preferences, such as favoring verbosity and authoritative tones. Our empirical analysis reveals that these biases are exacerbated in pairwise evaluation, where LLMs directly compare two outputs and easily prioritize superficial attributes. In contrast, pointwise evaluation, which assesses outputs independently, is less susceptible to such bias because each output is judged in isolation. To address the limitations of the pairwise evaluation, we introduce a novel evaluation method, PRePair, which integrates pointwise reasoning within a pairwise framework. PRePair effectively alleviates biased preference, improving performance on the adversarial benchmark (LLMBar) while outperforming pointwise evaluation on the standard benchmark (MT-Bench).

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Hawon Jeong (2 papers)
  2. ChaeHun Park (15 papers)
  3. Jimin Hong (9 papers)
  4. Jaegul Choo (161 papers)
  5. Hojoon Lee (22 papers)
Citations (1)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets