Do Large Language Model Benchmarks Test Reliability? (2502.03461v1)

Published 5 Feb 2025 in cs.LG and cs.CL

Abstract: When deploying LLMs, it is important to ensure that these models are not only capable, but also reliable. Many benchmarks have been created to track LLMs' growing capabilities, however there has been no similar focus on measuring their reliability. To understand the potential ramifications of this gap, we investigate how well current benchmarks quantify model reliability. We find that pervasive label errors can compromise these evaluations, obscuring lingering model failures and hiding unreliable behavior. Motivated by this gap in the evaluation of reliability, we then propose the concept of so-called platinum benchmarks, i.e., benchmarks carefully curated to minimize label errors and ambiguity. As a first attempt at constructing such benchmarks, we revise examples from fifteen existing popular benchmarks. We evaluate a wide range of models on these platinum benchmarks and find that, indeed, frontier LLMs still exhibit failures on simple tasks such as elementary-level math word problems. Analyzing these failures further reveals previously unidentified patterns of problems on which frontier models consistently struggle. We provide code at https://github.com/MadryLab/platinum-benchmarks

Summary

The paper finds that most large language model benchmarks contain significant label errors and ambiguities that mask model failures, proposing "platinum benchmarks" as a more reliable evaluation method.
Systematic cleaning of fifteen benchmarks revealed widespread issues like mislabeled solutions and ambiguous questions, with error rates in original datasets sometimes exceeding observed model errors.
Testing state-of-the-art LLMs on the proposed platinum benchmarks revealed persistent errors on simple tasks and specific reasoning failures like "First Event Bias" and "Rounding-Up Primes".

The paper presents an in‐depth investigation into the reliability evaluation of LLMs by scrutinizing existing benchmarks and introducing a framework for “platinum benchmarks.” The paper rigorously demonstrates that most widely used benchmarks, even those traditionally considered “saturated” (with accuracies in the 90–95% range), contain pervasive label errors and ambiguities. These issues can mask genuine model failures and lead to overly optimistic conclusions about model reliability.

The paper’s approach is twofold:

Systematic Benchmark Cleaning:
- Mislabeled Solutions: Instances where the provided answer is incorrect, as evidenced by examples from SVAMP and GSM8K.
- Question Flaws: Questions that contain logical contradictions, ambiguities, or omissions (e.g., missing equations in math prompts or ambiguous coreference statements in Winograd-style questions).
- Ambiguity and Ill-Posed Problems: Particularly prevalent in open-ended reading comprehension datasets such as SQuAD2.0 and HotPotQA.

A key aspect of their methodology is the use of multiple frontier LLMs to flag inconsistencies. Any example on which at least one model fails is manually inspected and either re-labeled or removed. Quantitatively, for several benchmarks (e.g., GSM8K, SVAMP, VQA v2.0, and TabFact), the proportion of flawed examples was significant—sometimes exceeding the observed model error rates. For instance, some benchmarks showed that over 75% of the errors in the original version could be attributed to label noise.

Introducing Platinum Benchmarks for Reliability:

The authors propose “platinum benchmarks” as a new paradigm for quantifying LLM reliability. In these benchmarks, every example is carefully curated so that 100% accuracy is attainable if the model is truly reliable. The platinum framework emphasizes that even when models achieve high accuracy on conventional benchmarks, the remaining performance gaps (e.g., the residual 5% error rate) can hide systematic failures on elementary tasks, which are critical for safety-sensitive applications.

The experiments reveal that: - Even state-of-the-art models—both proprietary (e.g., GPT-4 variants, Claude 3.5 models) and open-weight models—exhibit non-zero error rates on platinum benchmarks, with error rates sometimes manifesting as low as under 1% for simple arithmetic benchmarks (e.g., SingleOp, SingleEq) but remaining significant (2–7% or more) on tasks such as high school math or commonsense reasoning. - More capable models also tend to be more reliable; however, reliability is task dependent. For example, the o1 series and DeepSeek-R1 perform perfectly on some math benchmarks but struggle with coreference tasks in Winograd-style questions.

Beyond aggregate metrics, the paper identifies two notable reasoning failure patterns through a detailed analysis of chain-of-thought outputs:

First Event Bias:

When models are asked to determine the chronological order of two events—framed as “what happened second: X or Y”—several models (including Gemini 1.5 Flash, Gemini 1.5 Pro, and Mistral Small) erroneously favor the first event. In synthesized evaluations, these models err on more than 85% of such examples, despite the simplicity of the underlying reasoning.

Rounding-Up Primes:

In elementary arithmetic problems involving division, some models (notably Claude 3.5 Sonnet) incorrectly round up whole-number quotients. This error is especially noticeable when the correct quotient has a prime-like structure (i.e., few non-trivial divisors). Procedural experiments show that, for quotients near prime numbers, such rounding errors occur in approximately 20% of cases.

The paper makes an important distinction between overall capability and deployable reliability. It underscores that although LLMs can solve graduate-level problems (as seen in more advanced benchmarks like GPQA), they continue to slip on elementary tasks—a discrepancy that has significant implications for real-world, safety-critical applications.

The discussion also draws parallels to traditional site reliability engineering by suggesting that LLMs could benefit from reliability metrics analogous to “nines of uptime.” Two key limitations are acknowledged:

Domain Coverage: The current platinum benchmarks do not include other critical domains such as code generation or tool use.
Sample Size and Residual Ambiguities: Some revised benchmarks include only a limited number of examples, and there is a possibility that certain subtle ambiguities persist even after rigorous manual review.

In summary, the paper argues that current benchmark saturation should not be conflated with reliability. Instead, testing LLMs on platinum benchmarks—designed to eliminate label noise and question ambiguity—reveals persistent failures in simple yet critical tasks. This work therefore lays a foundation for more rigorous evaluation practices that aim to bridge the gap between raw model capability and true, deployable reliability.