Garbage In, Reasoning Out? Why Benchmark Scores are Unreliable and What to Do About It

Published 30 Jun 2025 in cs.CL | (2506.23864v1)

Abstract: We conduct a systematic audit of three widely used reasoning benchmarks, SocialIQa, FauxPas-EAI, and ToMi, and uncover pervasive flaws in both benchmark items and evaluation methodology. Using five LLMs (GPT-{3, 3.5, 4, o1}, and LLaMA 3.1) as diagnostic tools, we identify structural, semantic, and pragmatic issues in benchmark design (e.g., duplicated items, ambiguous wording, and implausible answers), as well as scoring procedures that prioritize output form over reasoning process. Through systematic human annotation and re-evaluation on cleaned benchmark subsets, we find that model scores often improve not due to due to erratic surface wording variations and not to improved reasoning. Infact, further analyses show that model performance is highly sensitive to minor input variations such as context availability and phrasing, revealing that high scores may reflect alignment with format-specific cues rather than consistent inference based on the input. These findings challenge the validity of current benchmark-based claims about reasoning in LLMs, and highlight the need for evaluation protocols that assess reasoning as a process of drawing inference from available information, rather than as static output selection. We release audited data and evaluation tools to support more interpretable and diagnostic assessments of model reasoning.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper shows that benchmark scores are unreliable because they reflect superficial lexical patterns instead of genuine reasoning.
It employs human annotation on SocialIQa, FauxPas-EAI, and ToMi benchmarks to identify structural, semantic, and pragmatic flaws.
It recommends shifting to dynamic, process-oriented evaluation methods, such as structured rationale assessment and LLM-as-a-judge.

Garbage In, Reasoning Out? Why Benchmark Scores are Unreliable and What to Do About It

This essay presents an extensive audit of reasoning benchmarks used to evaluate LLMs, highlighting pervasive flaws in their design and evaluation methods. The authors argue that current benchmark scores often reflect superficial lexical alignment rather than genuine reasoning capabilities.

Benchmark Analysis

The paper investigates SocialIQa, FauxPas-EAI, and ToMi benchmarks using five LLMs—GPT-3, GPT-3.5, GPT-4, GPT-o1, and LLaMA 3.1—serving as diagnostic tools to identify errors in benchmark data and methodology. Their analysis revealed that inconsistencies in these benchmarks result from duplicated items, ambiguous wording, implausible answers, and methodologies prioritizing form over substance. By conducting a systematic audit with human annotation, the authors reveal critical limitations in scoring metrics that impact reported performance.

Figure 1: Breakdown of total errors for each model on SocialIQA dev set, demonstrating a significant portion of errors due to benchmark flaws.

Dataset- and Evaluation-Level Issues

The audit categorized errors into structural, semantic, and pragmatic flaws within SocialIQa and ToMi datasets. Structural flaws included duplicate or malformed items that hindered the validity of reasoning evaluation, while semantic flaws involved ambiguous or illogical question and answer pairs. Pragmatic flaws highlighted unrealistic scenarios or culturally biased assumptions, impacting the interpretation of model performance.

In ToMi, flaws were also related to context fragmentation, semantics-agnostic scoring, and granularity mismatches in model responses. This distorts evaluation by failing to provide coherent context during multi-step inference tasks, penalizing semantically valid answers due to lexical variations, and overlooking necessary detail in responses.

Figure 2: Breakdown of total errors for each model on ToMi, illustrating where reasoning errors and benchmark flaws impact model scores.

Implications of High Scores

Upon filtering out flawed data and conducting context-aware evaluations, models showed improved performance but demonstrated instability when subjected to surface-level variations. High scores were often a result of alignment with benchmark artifacts rather than inference-based reasoning. The performance fluctuations during semantically equivalent rephrasing trials indicated sensitivity to specific lexical formats and not robust reasoning.

Figure 3: Performance on SocialIQa under semantically equivalent rephrasings, revealing sensitivity to surface-level variation.

Recommendations for Future Evaluation

The findings advocate for a shift from static output metrics to evaluations grounded in reasoning processes. Suggested alternatives include structured rationale evaluation, contradiction detection, counterfactual testing, and interactive diagnostic formats. These approaches emphasize reasoning as a dynamic and explainable process, crucial for accurately assessing model competence.

Use of LLM-as-a-Judge

The paper briefly explores using LLMs as evaluators for meaning-aware scoring as a scalable alternative to semantics-agnostic metrics. The experiment showed that LLMs can provide reasonable evaluations of model predictions, suggesting a potential role for LLMs in diagnostic evaluations alongside human judgments.

Conclusion

The audit reveals fundamental shortcomings in current reasoning benchmarks and highlights the need for more process-oriented evaluation methods that assess inference from context rather than lexical alignment. These findings underscore the importance of transitioning from static benchmarks to dynamic evaluative approaches to better reflect the reasoning and inferential capabilities of LLMs.

By acknowledging the limitations and biases intrinsic to existing benchmarks, this paper suggests imperative reforms toward developing more reliable and contextually grounded evaluation protocols for future LLM research and implementation.

Markdown Report Issue