Unclear alignment of accuracy-based rankings with human preferences

Determine how well model rankings induced by overall accuracy across an evaluation dataset for large language models align with human preference rankings, such as those aggregated by the Chatbot Arena platform, in order to assess the validity of accuracy as a summary metric for human-aligned evaluation.

Background

The paper investigates whether game-theoretic peer evaluation and rank aggregation can produce model rankings that align with human judgments. While overall accuracy is a common summary metric in benchmark evaluations, the authors note that its relationship to human preferences is not straightforward and may fail to capture subjective or open-ended aspects of performance.

To probe this uncertainty, the paper compares accuracy-induced rankings with those obtained via Kemeny-Young aggregation of per-question peer evaluations and evaluates both against human preferences from Chatbot Arena. This context highlights the need to quantify the degree to which accuracy reflects human-aligned judgments.

References

While overall accuracy is commonly used to summarize a modelâs performance on a dataset, it remains unclear how well this metric aligns with human preferences.

— LLMs Judge Themselves: A Game-Theoretic Framework for Human-Aligned Evaluation (2510.15746 - Yang et al., 17 Oct 2025) in Section 4.1 (Q1: Can Game-Theoretic Evaluation Align with Human Judgment?), paragraph preceding Table 2

Unclear alignment of accuracy-based rankings with human preferences

Background

References

Related Problems