am-ELO: A Stable Framework for Arena-based LLM Evaluation (2505.03475v2)

Published 6 May 2025 in cs.AI and cs.LG

Abstract: Arena-based evaluation is a fundamental yet significant evaluation paradigm for modern AI models, especially LLMs. Existing framework based on ELO rating system suffers from the inevitable instability problem due to ranking inconsistency and the lack of attention to the varying abilities of annotators. In this paper, we introduce a novel stable arena framework to address these issues by enhancing the ELO Rating System. Specifically, we replace the iterative update method with a Maximum Likelihood Estimation (MLE) approach, m-ELO, and provide theoretical proof of the consistency and stability of the MLE approach for model ranking. Additionally, we proposed the am-ELO, which modify the Elo Rating's probability function to incorporate annotator abilities, enabling the simultaneous estimation of model scores and annotator reliability. Experiments demonstrate that this method ensures stability, proving that this framework offers a more robust, accurate, and stable evaluation method for LLMs.

Summary

Overview of am-ELO: A Stable Framework for Arena-based LLM Evaluation

The paper "am-ELO: A Stable Framework for Arena-based LLM Evaluation" addresses the inherent instability in the use of the ELO rating system for evaluating LLMs within competitive environments, or "model arenas". Traditional applications of the ELO system, which were originally designed for dynamic environments such as competitive games, have been shown to produce inconsistent results when applied to the evaluation of static datasets, a challenge compounded by the variability in annotator performance.

To rectify these issues, this paper proposes an enhanced framework leveraging Maximum Likelihood Estimation (MLE) alongside modifications to incorporate annotator abilities. The proposed approach, termed m-ELO, replaces the iterative update method of classical ELO with a stable MLE-driven estimation process, ensuring consistent model ranking. Furthermore, the am-ELO extends the original ELO methodology by adjusting the probability function to consider the skill levels of annotators, allowing for simultaneous estimation of both model performance and annotator reliability.

Key Experimental Outcomes

The empirical tests presented in the paper demonstrate significant improvements in the stability and efficacy of ELO score estimations. The innovative am-ELO method reduced the inconsistency rate of scores by 30% when compared to the classical ELO methodology. Moreover, in detecting anomalous annotators during simulation experiments, am-ELO's ability-based adjustments proved robust, achieving a high accuracy of up to 95% in anomalous annotator identification.

The authors compare the performance of the enhanced framework against the classical ELO methodology using a dataset from the Chatbot Arena, consisting of over 33,000 annotated interactions. Notably, while all three models — classical ELO, m-ELO, and am-ELO — showed alignment in general ranking, am-ELO delivered superior fitting in prediction tasks, evidenced by a significant decrease in log-likelihood loss and a marked increase in mean square error (MSE) and area under the curve (AUC) metrics.

Implications and Future Directions

This research illustrates the methodological advancements needed to stabilize model rankings in LLM arenas, with practical implications for enhancing the reliability and validity of evaluation scores. As the deployment of LLMs continues across diverse tasks, having a consistent and annotator-cognizant evaluation framework becomes crucial, especially in high-stakes decision-making scenarios.

While the paper provides a compelling solution to the instabilities of the ELO method, it acknowledges limitations in the complexity of annotator modeling, suggesting a future refinement of these dimensions. Such enhancements could further capitalize on the crowd-sourced evaluation, increasing the richness and precision of the feedback loop in model assessment. Future extensions might explore deeper psychometric models to encapsulate a wider range of annotator capabilities, thereby expanding the robustness of the evaluation framework.

In conclusion, the am-ELO framework represents a significant contribution to the field of LLM evaluation, offering a more stable, interpretable, and effective assessment tool over existing methods. As AI continues to evolve, methodologies such as am-ELO will be instrumental in ensuring that model evaluations remain accurate and reflective of true computational capability in varied annotation environments.

Tweets

https://twitter.com/JiatongLi0418/status/1927612733870276803