Unclear alignment of accuracy-based rankings with human preferences
Determine how well model rankings induced by overall accuracy across an evaluation dataset for large language models align with human preference rankings, such as those aggregated by the Chatbot Arena platform, in order to assess the validity of accuracy as a summary metric for human-aligned evaluation.
References
While overall accuracy is commonly used to summarize a modelâs performance on a dataset, it remains unclear how well this metric aligns with human preferences.
— LLMs Judge Themselves: A Game-Theoretic Framework for Human-Aligned Evaluation
(2510.15746 - Yang et al., 17 Oct 2025) in Section 4.1 (Q1: Can Game-Theoretic Evaluation Align with Human Judgment?), paragraph preceding Table 2