Judging the Judges: A Systematic Study of Position Bias in LLM-as-a-Judge (2406.07791v7)

Published 12 Jun 2024 in cs.CL and cs.AI

Abstract: LLM-as-a-Judge presents a promising alternative to human evaluators across various tasks, but inherent biases, especially position bias - a tendency to favor solutions based on their position in the prompt - have compromised its effectiveness. Our study introduces a systematic framework to examine position bias in pairwise comparisons, focusing on repetition stability, position consistency, and preference fairness. This research significantly contributes to the field by introducing new concepts for understanding position bias and providing a multi-dimensional framework for evaluations. We conducted experiments with 12 LLM judges across MTBench and DevBench, covering 22 tasks and approximately 40 solution-generating models - candidates, resulting in over 100,000 evaluation instances. Our findings confirm that position bias in capable LLM judges is not due to random chances, along with notable variations observed across judges and tasks. Moreover, position bias is weakly influenced by the length of prompt components but significantly impacted by the quality gap between solutions. These insights can help optimize judge model selections, improve benchmark design, and inform future research on debiasing strategies, ultimately enhancing the reliability of LLM judges.

PDF HTML Abstract

Summarize PDF Markdown Bookmark Chat (Pro)

Authors (5)

Lin Shi (39 papers)
Weicheng Ma (22 papers)
Soroush Vosoughi (90 papers)
Chiyu Ma (8 papers)
Wenhua Liang (3 papers)

Judging the Judges: A Systematic Study of Position Bias in LLM-as-a-Judge (2406.07791v7)

Related Papers

Tweets