Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Leaderboard Illusion: Benchmarking Bias in AI

Updated 2 July 2025
  • Leaderboard Illusion is the phenomenon where leaderboard evaluations in AI are distorted by design flaws, selective testing, and data asymmetries.
  • It demonstrates that undisclosed private testing and preferential sampling among proprietary models can artificially boost leaderboard scores, misrepresenting genuine progress.
  • Reforms such as full result transparency, caps on private submissions, and unbiased, active sampling are essential to ensure fair and accurate benchmarking.

The leaderboard illusion describes a fundamental tension at the core of measurement and benchmarking within empirical sciences, particularly in AI and machine learning. While leaderboards are intended to provide objective indicators of progress, systemic vulnerabilities and design flaws can systematically distort what is actually being measured, leading to misperceptions about comparative capability, merit, and scientific advancement.

1. Systematic Biases in Leaderboard Design and Operation

Leaderboards such as Chatbot Arena have become central to comparative evaluation in AI, ranking state-of-the-art (SOTA) LLMs through large-scale, anonymized paired comparisons. However, the paper identifies a cluster of design and operational flaws that generate distorted outcomes:

  • Undisclosed Private Testing: Individual providers (notably Meta, Google, OpenAI, Amazon) privately submit and evaluate numerous pre-release variants of their models on the leaderboard before public announcement, reporting only preferred results and retracting the rest.
  • Score Retraction and Selective Disclosure: Underperforming test scores or entire model variants can be silently withdrawn, biasing the ranking process toward the “winner” among many private attempts.
  • Preferential Sampling and Data Access: Proprietary models are sampled at higher frequencies (i.e., appear more often in head-to-head battles), and have more versions simultaneously hosted, while open-weight and open-source models receive much less exposure.
  • Model Deprecation Policies: Silent and unannounced removal (“deprecation”) of open models further compounds the disadvantage, as comparative connections in the ranking system are lost.
  • Overfitting to Arena-Specific Distribution: The combination of these practices incentivizes tuning models to the evaluation peculiarities of the leaderboard, not to broad, general capability.
  • Statistical Invalidity: The Bradley-Terry model underpinning Arena scores assumes unbiased, connected sampling of comparisons—an assumption violated under selective test and deprecation policies.

These dynamics result in a leaderboard that disproportionately signals the superiority of models from a few dominant providers, rather than providing a fair or general measure of scientific progress.

2. Private Testing and the Bias of Selective Disclosure

A central mechanism of distortion is the undisclosed private testing and selective reporting of models:

  • Practical Example: Before the public release of Llama 4, Meta tested twenty-seven private model variants on the Arena, and similarly, Google tested ten before launching Gemini 3.
  • Simulation Results: If a provider submits NN private variants and only the highest-performing is released, Arena scores are systematically inflated. For example, testing ten identical variants and selecting the maximum yields a mean Arena gain of approximately 100 points over a single submission, as directly observed in Figure 5 of the paper.
  • Empirical Demonstration: Two identical Aya-Vision-8B model submissions with different “private” entries yielded Arena scores of 1069 vs. 1052, demonstrating that the maximum is not simply a random sample from an unbiased distribution.
  • Mathematical Statement: Let β^1,...,β^N\hat{\beta}_1, ..., \hat{\beta}_N be independent estimated skill values; then the expectation of the maximum always exceeds the average:

E[β^Best]>E[β^k],k=1,...,N\mathbb{E}[\hat{\beta}_{\mathrm{Best}}] > \mathbb{E}[\hat{\beta}_k], \qquad \forall k = 1,...,N

This artificial boosting advantage is inaccessible to open models, which typically submit only one variant, thereby systematically biasing the rankings.

3. Data Access Asymmetries and Sampling Imbalance

Data exposure is also highly asymmetric:

  • Arena Exposure: Providers such as OpenAI and Google each receive 19–20% of all Arena prompt data; combined, four proprietary vendors receive over 62% of the entire exposure, whereas 41 open-source models share just 8.8%.
  • Sample Frequency: Proprietary models reach sampling rates of up to 34% (frequent pairings), while open-weight models are capped at 3.3% in observed data.
  • Deprecation: 87.8% of open-weight and 89% of open-source models are eventually deprecated (withdrawn from active evaluation), compared with 80% of proprietary models, with no public record of removal.
  • Community vs. Proprietary Data: Proprietary operators harvest all user prompts and outcomes (critical for tuning), versus a capped 20% accessible to open models.

This structural disparity in data access means that proprietary models both participate in and optimize for the evaluation distribution at a much greater scale.

4. Impact on Performance: Overfitting to Leaderboard Dynamics

The systematic advantage in data and test access drives leaderboard performance, potentially at the expense of true model quality:

  • Arena-Specific Gains: Training a model on up to 70% Arena data increases Arena scores from 23.5% to 49.9% win-rate (a +112% gain), but produces negligible improvement—and even small declines—on independent academic benchmarks (e.g., MMLU accuracy).
  • Prompt Repetition: 7–9% of battle prompts recur in the Arena dataset (with trivial reformulations), meaning privileged data access enables almost direct memorization and overfitting.
  • Distribution Shift: As user behavior and prompt language drift in the Arena (e.g., increased Russian or Chinese prompts), providers with full access adapt more quickly, further enhancing in-arena performance relative to general functionality.

The result is optimization for Arena-specific win conditions rather than genuine improvements in broad linguistic, reasoning, or generalization capability—a defining characteristic of the leaderboard illusion.

5. Statistical Models and Formulae: Understanding Ranking Bias

Leaderboard rankings in Chatbot Arena are underpinned by the Bradley-Terry model for paired comparisons. The core mechanics:

  • Win Probability:

P(model i beats j)=11+e(βjβi)P(\text{model } i \text{ beats } j) = \frac{1}{1 + e^{(\beta_j - \beta_i)}}

where β\beta is the estimated skill.

  • Estimation:

β^=argminβRm1nk=1n(σ(XTβ)k,Yk)\hat{\beta} = \arg\min_{\beta \in \mathbb{R}^m} \frac{1}{n} \sum_{k=1}^n \ell( \sigma(X^T \beta)_k, Y_k )

with σ\sigma the logistic function, and \ell binary cross-entropy loss.

  • Conversion to Arena Score:

Rm=scale×β+initial ratingR_m = \text{scale} \times \beta + \text{initial rating}

  • Sampling Bias: Active (variance-reducing) sampling is recommended, but not practiced, in preference for explicit provider advantages.

Mathematically, the “best-of-N” effect (order statistics) described above means that repeated, undisclosed private testing and selective reporting result in a systematic bias in skill estimates, even absent any genuine model advance.

6. Recommendations for Reform

To address these biases, the paper proposes the following:

  1. Publish All Results: Prohibit post-hoc result retraction. Every evaluation—including failed private tests—should be permanently accessible.
  2. Limit Private Variants: Impose a small, transparent cap (e.g., three concurrent variants per provider) on private model evaluation.
  3. Fair Deprecation: Apply model deprecation equitably across all groups (proprietary, open-weight, open-source) by percentile, not favor.
  4. Active, Unbiased Sampling: Switch to active sampling prioritizing uncertain match-ups, rather than increasingly favoring proprietary models.
  5. Full Transparency: Require public logs of all model versions, sampling frequencies, deprecations, and associated battle histories.

These recommendations are positioned as necessary to restore trust in leaderboards as scientific instruments and to meaningfully reflect model merit.


Summary Table: Systematic Leaderboard Illusion Dynamics

Mechanism Proprietary Labs Open Models Effect on Leaderboard
Private/undisclosed submissions Dozens per release 1 per release Ranking bias toward proprietary labs
Results retraction Permitted No data Overstates progress for select models
Data exposure per model 20%+ (EA) 3–8% (many share) Data access asymmetry
Deprecation/retirement Disproportionally rare Common Severed comparisons, loss of transparency
Data-driven in-arena overfitting Yes Very restricted Score increases tied to test distribution
Correctness on public benchmarks Not correlated Not privileged Illusion of general merit

Conclusion

The leaderboard illusion arises when the ostensible objectivity of benchmarking systems is undermined by structural, technological, and policy choices, as exemplified by Chatbot Arena. Systematic use of undisclosed private testing, selective score reporting, and preferential sampling for select model providers generate an environment where leaderboard ranking reflects not genuine scientific or engineering merit, but the ability to manipulate access and measurement. This distorts the research community’s perception of progress, shapes external perceptions of AI capability, and risks misallocation of resources.

The paper concludes that only radical transparency—a combination of result publication, sampling fairness, and capped provider privileges—can restore the leaderboard as a reliable proxy for real progress, and prevent the deepening of the leaderboard illusion.