- The paper finds that selective private testing inflates scores by allowing providers to test multiple variants and only report the best, distorting true model performance.
- The paper demonstrates that data access asymmetries heavily favor large proprietary providers, giving them up to 68 times more data than academic labs.
- The paper shows that overfitting to Chatbot Arena data increases benchmark win-rates without general capability improvements, compromising leaderboard reliability.
This paper, "The Leaderboard Illusion" (2504.20879), conducts a systematic review of Chatbot Arena, a prominent leaderboard for ranking LLMs, and identifies several practices and policies that distort its rankings and favor a small group of model providers, primarily large technology companies. The authors argue that these issues lead to models overfitting to Arena-specific dynamics rather than demonstrating genuine improvements in general capabilities, illustrating a potential instance of Goodhart's Law.
The paper is based on an analysis of multiple data sources totaling over 2 million battles, involving 243 models from 42 providers, covering the period from January 2024 to April 2025. The authors categorize models by license type (proprietary, open-weight, open-source) to analyze trends and disparities. The core of the Arena's ranking system is based on the Bradley-Terry (BT) model [bradley1952rank], a probabilistic framework for estimating skill from pairwise comparisons. The paper examines how deviations from the BT model's assumptions, such as unbiased sampling, transitivity, and graph connectivity, contribute to ranking unreliability.
Key Findings and Practical Implications:
- Impact of Private Testing and Selective Retraction:
- Finding: Chatbot Arena has an undisclosed policy allowing certain providers to test multiple model variants privately and only submit the best-performing one to the public leaderboard. Providers like Meta, Google, and Amazon were observed extensively using this practice (\Cref{fig-private-testing-providers}). Meta alone tested 27 private variants in the lead-up to a Llama 4 release.
- Mechanism: This "best-of-N" strategy violates the unbiased sampling assumption of the BT model. Selecting the maximum score from N noisy estimates systematically inflates the reported score compared to a single, unbiased submission. Theoretically, the expected value of the maximum of N samples is strictly greater than the expected value of a single sample: E[β^Best]>E[β^k] for N≥2 and non-degenerate distributions (as detailed in \Cref{app:unbiased_sampling}).
- Evidence: Simulations show that testing just 10 variants can lead to an approximately 100-point increase in the expected maximum Arena Score (\Cref{fig:number-of-variants}). This can enable a weaker model family to outrank a stronger one if only the former uses this strategy (\Cref{fig-submission-strategy}). Real-world experiments by the authors, testing identical and slightly different model variants, confirmed that submitting multiple models and selecting the best score leads to tangible ranking advantages, even for identical checkpoints (\Cref{sec:real-world-exp}, \Cref{fig:modelcomparisons}).
- Implication: Leaderboard rankings are not solely based on model capability but can be manipulated by providers with the resources and knowledge to test numerous private variants and selectively report results. This distorts the playing field and makes it difficult to gauge true model progress.
- Data Access Asymmetries:
- Finding: Proprietary model providers receive a significantly larger share of data (prompts and battle outcomes) from Chatbot Arena compared to open-weight and open-source providers. This is due to a combination of factors: the number of private variants tested, unequal sampling rates, deprecation policies, and API access allowing 100% data collection vs. the 20% policy for others.
- Evidence: OpenAI, Google, Meta, and Anthropic collectively received an estimated 62.8% of the total Arena data, approximately 68 times more than major academic and non-profit labs combined (\Cref{fig:public_private_data}). Sampling rates vary drastically; Google and OpenAI models had maximum daily sampling rates up to 34%, much higher than others (\Cref{fig-private-testing-max}).
- Implication: This creates a substantial data advantage for a few large, proprietary labs, enabling them to better understand and potentially optimize for the Arena's specific data distribution. Given Chatbot Arena is a community-driven platform relying on free user feedback, this asymmetry benefits commercial entities disproportionately.
- Risk of Overfitting to Arena Data:
- Finding: Access to Chatbot Arena data can lead to significant performance gains specifically on the Arena distribution, suggesting a risk of overfitting. While the Arena's dynamic nature might seem resistant to overfitting, the data includes both long-term shifts in prompt distribution (\Cref{fig-language-distribution}) and notable levels of prompt duplication/near-duplication over time (\Cref{fig:duplicate_prompt}, \Cref{app:prompt-duplication-headmap}).
- Evidence: The authors' fine-tuning experiments showed that increasing the proportion of Arena data in a supervised fine-tuning mix significantly improved win-rates on ArenaHard (a benchmark highly correlated with Chatbot Arena outcomes) by up to 112% relative to a model trained without Arena data (\Cref{fig:overfit-sft-exp}). However, these gains did not generalize to an out-of-distribution benchmark like MMLU (\Cref{tab:overfitting_mmlu}), indicating that the improvement is specific to the Arena distribution.
- Implication: Model providers with extensive access to Arena data can potentially fine-tune their models to excel on the Arena distribution, gaining a competitive edge on the leaderboard without necessarily improving general model capabilities. This incentivizes optimizing for the leaderboard metric rather than real-world performance.
- Impact of Model Deprecation on Ranking Reliability:
- Finding: Model deprecation, particularly silent deprecation (reducing sampling rate to near zero without notification), is widespread on Chatbot Arena, affecting 205 out of 243 public models during the paper period, significantly more than the officially listed 47 (\Cref{fig-silent-deprecated}). Open-weight and open-source models are disproportionately affected by deprecation (\Cref{fig-silent-deprecated-license-cat}, \Cref{fig-silent-deprecated_prop_open}).
- Mechanism: Deprecation under a changing task distribution violates the BT model's assumption of constant evaluation conditions. Models evaluated mostly on older task distributions may have rankings that don't reflect performance on current tasks (\Cref{sec:dist-shift}). Excessive or uneven deprecation can also lead to a sparse or disconnected comparison graph, violating the BT assumption of connectivity. This results in unreliable global rankings and makes it impossible to estimate relative strengths between models in disconnected clusters (\Cref{sparse_battle_history}, \Cref{fig:graph-connectivity}).
- Implication: The high rate of model deprecation, especially silent and uneven deprecation, undermines the statistical reliability of the BT rankings on Chatbot Arena. Models, particularly open models, may have unreliable or uninterpretable scores if their comparison history is sparse, outdated, or isolated.
Recommendations for Improvement:
The authors propose several actionable recommendations to restore fairness, transparency, and trust in Chatbot Arena:
- Prohibit Score Retraction: All tested model variants, including private ones, should have their scores permanently published upon submission.
- Establish Transparent Limits on Private Models: Enforce a strict, publicly disclosed limit on the number of private variants a provider can test concurrently (e.g., maximum 3 variants per provider). This should apply equally to all providers.
- Ensure Fair Deprecation: Implement clear, auditable criteria for model removal. A stratified approach that deprecates models proportionally across proprietary, open-weight, and open-source categories (e.g., removing the bottom 30% in each group after convergence) could maintain balance and connectivity.
- Implement Fair Sampling: Adopt an active sampling strategy, potentially the one previously proposed by the Arena organizers [chiang2024chatbot], that prioritizes under-evaluated pairs to reduce uncertainty in rankings, rather than favoring specific providers.
- Provide Public Transparency: Publicly disclose information about all tested models (including private aliases once testing concludes), deprecated models (both official and silent), and detailed sampling rates over time.
Limitations:
The authors acknowledge limitations, including lack of access to Chatbot Arena's raw data (which might obscure issues like adversarial voting), the limited duration of their data scraping snapshot (January-March 2025), potential underestimation of overfitting effects (due to limited data access), and reliance on model self-identification for attributing private models.
In conclusion, the paper argues that while Chatbot Arena is a valuable community resource, current policies and practices have created a distorted playing field that favors a few large providers, incentivizes overfitting to the benchmark, and compromises the reliability of the rankings. The authors call for urgent reforms centered on transparency and fairness to restore scientific integrity to the leaderboard.