The Leaderboard Illusion (2504.20879v2)

Published 29 Apr 2025 in cs.AI, cs.CL, cs.LG, and stat.ME

Abstract: Measuring progress is fundamental to the advancement of any scientific field. As benchmarks play an increasingly central role, they also grow more susceptible to distortion. Chatbot Arena has emerged as the go-to leaderboard for ranking the most capable AI systems. Yet, in this work we identify systematic issues that have resulted in a distorted playing field. We find that undisclosed private testing practices benefit a handful of providers who are able to test multiple variants before public release and retract scores if desired. We establish that the ability of these providers to choose the best score leads to biased Arena scores due to selective disclosure of performance results. At an extreme, we identify 27 private LLM variants tested by Meta in the lead-up to the Llama-4 release. We also establish that proprietary closed models are sampled at higher rates (number of battles) and have fewer models removed from the arena than open-weight and open-source alternatives. Both these policies lead to large data access asymmetries over time. Providers like Google and OpenAI have received an estimated 19.2% and 20.4% of all data on the arena, respectively. In contrast, a combined 83 open-weight models have only received an estimated 29.7% of the total data. We show that access to Chatbot Arena data yields substantial benefits; even limited additional data can result in relative performance gains of up to 112% on the arena distribution, based on our conservative estimates. Together, these dynamics result in overfitting to Arena-specific dynamics rather than general model quality. The Arena builds on the substantial efforts of both the organizers and an open community that maintains this valuable evaluation platform. We offer actionable recommendations to reform the Chatbot Arena's evaluation framework and promote fairer, more transparent benchmarking for the field

Summary

The paper finds that selective private testing inflates scores by allowing providers to test multiple variants and only report the best, distorting true model performance.
The paper demonstrates that data access asymmetries heavily favor large proprietary providers, giving them up to 68 times more data than academic labs.
The paper shows that overfitting to Chatbot Arena data increases benchmark win-rates without general capability improvements, compromising leaderboard reliability.

This paper, "The Leaderboard Illusion" (2504.20879), conducts a systematic review of Chatbot Arena, a prominent leaderboard for ranking LLMs, and identifies several practices and policies that distort its rankings and favor a small group of model providers, primarily large technology companies. The authors argue that these issues lead to models overfitting to Arena-specific dynamics rather than demonstrating genuine improvements in general capabilities, illustrating a potential instance of Goodhart's Law.

The paper is based on an analysis of multiple data sources totaling over 2 million battles, involving 243 models from 42 providers, covering the period from January 2024 to April 2025. The authors categorize models by license type (proprietary, open-weight, open-source) to analyze trends and disparities. The core of the Arena's ranking system is based on the Bradley-Terry (BT) model [bradley1952rank], a probabilistic framework for estimating skill from pairwise comparisons. The paper examines how deviations from the BT model's assumptions, such as unbiased sampling, transitivity, and graph connectivity, contribute to ranking unreliability.

Key Findings and Practical Implications:

Impact of Private Testing and Selective Retraction:
- Finding: Chatbot Arena has an undisclosed policy allowing certain providers to test multiple model variants privately and only submit the best-performing one to the public leaderboard. Providers like Meta, Google, and Amazon were observed extensively using this practice (\Cref{fig-private-testing-providers}). Meta alone tested 27 private variants in the lead-up to a Llama 4 release.
- Mechanism: This "best-of-N" strategy violates the unbiased sampling assumption of the BT model. Selecting the maximum score from N noisy estimates systematically inflates the reported score compared to a single, unbiased submission. Theoretically, the expected value of the maximum of N samples is strictly greater than the expected value of a single sample: $\mathbb{E}[\hat{\beta}_{\text{Best}}] > \mathbb{E}[\hat{\beta}_k]$ for $N \geq 2$ and non-degenerate distributions (as detailed in \Cref{app:unbiased_sampling}).
- Evidence: Simulations show that testing just 10 variants can lead to an approximately 100-point increase in the expected maximum Arena Score (\Cref{fig:number-of-variants}). This can enable a weaker model family to outrank a stronger one if only the former uses this strategy (\Cref{fig-submission-strategy}). Real-world experiments by the authors, testing identical and slightly different model variants, confirmed that submitting multiple models and selecting the best score leads to tangible ranking advantages, even for identical checkpoints (\Cref{sec:real-world-exp}, \Cref{fig:modelcomparisons}).
- Implication: Leaderboard rankings are not solely based on model capability but can be manipulated by providers with the resources and knowledge to test numerous private variants and selectively report results. This distorts the playing field and makes it difficult to gauge true model progress.
Data Access Asymmetries:
- Finding: Proprietary model providers receive a significantly larger share of data (prompts and battle outcomes) from Chatbot Arena compared to open-weight and open-source providers. This is due to a combination of factors: the number of private variants tested, unequal sampling rates, deprecation policies, and API access allowing 100% data collection vs. the 20% policy for others.
- Evidence: OpenAI, Google, Meta, and Anthropic collectively received an estimated 62.8% of the total Arena data, approximately 68 times more than major academic and non-profit labs combined (\Cref{fig:public_private_data}). Sampling rates vary drastically; Google and OpenAI models had maximum daily sampling rates up to 34%, much higher than others (\Cref{fig-private-testing-max}).
- Implication: This creates a substantial data advantage for a few large, proprietary labs, enabling them to better understand and potentially optimize for the Arena's specific data distribution. Given Chatbot Arena is a community-driven platform relying on free user feedback, this asymmetry benefits commercial entities disproportionately.
Risk of Overfitting to Arena Data:
- Finding: Access to Chatbot Arena data can lead to significant performance gains specifically on the Arena distribution, suggesting a risk of overfitting. While the Arena's dynamic nature might seem resistant to overfitting, the data includes both long-term shifts in prompt distribution (\Cref{fig-language-distribution}) and notable levels of prompt duplication/near-duplication over time (\Cref{fig:duplicate_prompt}, \Cref{app:prompt-duplication-headmap}).
- Evidence: The authors' fine-tuning experiments showed that increasing the proportion of Arena data in a supervised fine-tuning mix significantly improved win-rates on ArenaHard (a benchmark highly correlated with Chatbot Arena outcomes) by up to 112% relative to a model trained without Arena data (\Cref{fig:overfit-sft-exp}). However, these gains did not generalize to an out-of-distribution benchmark like MMLU (\Cref{tab:overfitting_mmlu}), indicating that the improvement is specific to the Arena distribution.
- Implication: Model providers with extensive access to Arena data can potentially fine-tune their models to excel on the Arena distribution, gaining a competitive edge on the leaderboard without necessarily improving general model capabilities. This incentivizes optimizing for the leaderboard metric rather than real-world performance.
Impact of Model Deprecation on Ranking Reliability:
- Finding: Model deprecation, particularly silent deprecation (reducing sampling rate to near zero without notification), is widespread on Chatbot Arena, affecting 205 out of 243 public models during the paper period, significantly more than the officially listed 47 (\Cref{fig-silent-deprecated}). Open-weight and open-source models are disproportionately affected by deprecation (\Cref{fig-silent-deprecated-license-cat}, \Cref{fig-silent-deprecated_prop_open}).
- Mechanism: Deprecation under a changing task distribution violates the BT model's assumption of constant evaluation conditions. Models evaluated mostly on older task distributions may have rankings that don't reflect performance on current tasks (\Cref{sec:dist-shift}). Excessive or uneven deprecation can also lead to a sparse or disconnected comparison graph, violating the BT assumption of connectivity. This results in unreliable global rankings and makes it impossible to estimate relative strengths between models in disconnected clusters (\Cref{sparse_battle_history}, \Cref{fig:graph-connectivity}).
- Implication: The high rate of model deprecation, especially silent and uneven deprecation, undermines the statistical reliability of the BT rankings on Chatbot Arena. Models, particularly open models, may have unreliable or uninterpretable scores if their comparison history is sparse, outdated, or isolated.

Recommendations for Improvement:

The authors propose several actionable recommendations to restore fairness, transparency, and trust in Chatbot Arena:

Prohibit Score Retraction: All tested model variants, including private ones, should have their scores permanently published upon submission.
Establish Transparent Limits on Private Models: Enforce a strict, publicly disclosed limit on the number of private variants a provider can test concurrently (e.g., maximum 3 variants per provider). This should apply equally to all providers.
Ensure Fair Deprecation: Implement clear, auditable criteria for model removal. A stratified approach that deprecates models proportionally across proprietary, open-weight, and open-source categories (e.g., removing the bottom 30% in each group after convergence) could maintain balance and connectivity.
Implement Fair Sampling: Adopt an active sampling strategy, potentially the one previously proposed by the Arena organizers [chiang2024chatbot], that prioritizes under-evaluated pairs to reduce uncertainty in rankings, rather than favoring specific providers.
Provide Public Transparency: Publicly disclose information about all tested models (including private aliases once testing concludes), deprecated models (both official and silent), and detailed sampling rates over time.

Limitations:

The authors acknowledge limitations, including lack of access to Chatbot Arena's raw data (which might obscure issues like adversarial voting), the limited duration of their data scraping snapshot (January-March 2025), potential underestimation of overfitting effects (due to limited data access), and reliance on model self-identification for attributing private models.

In conclusion, the paper argues that while Chatbot Arena is a valuable community resource, current policies and practices have created a distorted playing field that favors a few large providers, incentivizes overfitting to the benchmark, and compromises the reliability of the rankings. The authors call for urgent reforms centered on transparency and fairness to restore scientific integrity to the leaderboard.

PDF Markdown

Related Papers

Tweets

https://twitter.com/karpathy/status/1917546757929722115

https://twitter.com/random_walker/status/1917516403977994378

https://twitter.com/sarahookr/status/1917562534724595897

https://twitter.com/fizzarolliAI/status/1917411479214592486

https://twitter.com/beyzaermis/status/1917569139855745373

https://twitter.com/ClementDelangue/status/1917565206667931994

YouTube

Show All Videos

HackerNews

The Leaderboard Illusion (182 points, 51 comments)

Reddit

[R] The Leaderboard Illusion (44 points, 1 comment)
The Leaderboard Illusion (12 points, 3 comments)
The Leaderboard Illusion (2 points, 1 comment)
The Leaderboard Illusion (2 points, 0 comments)
The Leaderboard Illusion (2 points, 0 comments)