Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Judging the Judges: A Systematic Study of Position Bias in LLM-as-a-Judge (2406.07791v7)

Published 12 Jun 2024 in cs.CL and cs.AI

Abstract: LLM-as-a-Judge presents a promising alternative to human evaluators across various tasks, but inherent biases, especially position bias - a tendency to favor solutions based on their position in the prompt - have compromised its effectiveness. Our study introduces a systematic framework to examine position bias in pairwise comparisons, focusing on repetition stability, position consistency, and preference fairness. This research significantly contributes to the field by introducing new concepts for understanding position bias and providing a multi-dimensional framework for evaluations. We conducted experiments with 12 LLM judges across MTBench and DevBench, covering 22 tasks and approximately 40 solution-generating models - candidates, resulting in over 100,000 evaluation instances. Our findings confirm that position bias in capable LLM judges is not due to random chances, along with notable variations observed across judges and tasks. Moreover, position bias is weakly influenced by the length of prompt components but significantly impacted by the quality gap between solutions. These insights can help optimize judge model selections, improve benchmark design, and inform future research on debiasing strategies, ultimately enhancing the reliability of LLM judges.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Lin Shi (39 papers)
  2. Weicheng Ma (22 papers)
  3. Soroush Vosoughi (90 papers)
  4. Chiyu Ma (8 papers)
  5. Wenhua Liang (3 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets