Does SWE-Bench-Verified Test Agent Ability or Model Memory? (2512.10218v1)

Published 11 Dec 2025 in cs.SE

Abstract: SWE-Bench-Verified, a dataset comprising 500 issues, serves as a de facto benchmark for evaluating various LLMs on their ability to resolve GitHub issues. But this benchmark may overlap with model training data. If that is true, scores may reflect training recall, not issue-solving skill. To study this, we test two Claude models that frequently appear in top-performing agents submitted to the benchmark. We ask them to find relevant files using only issue text, and then issue text plus file paths. We then run the same setup on BeetleBox and SWE-rebench. Despite both benchmarks involving popular open-source Python projects, models performed 3 times better on SWE-Bench-Verified. They were also 6 times better at finding edited files, without any additional context about the projects themselves. This gap suggests the models may have seen many SWE-Bench-Verified tasks during training. As a result, scores on this benchmark may not reflect an agent's ability to handle real software issues, yet it continues to be used in ways that can misrepresent progress and lead to choices that favour agents that use certain models over strong agent design. Our setup tests the localization step with minimal context to the extent that the task should be logically impossible to solve. Our results show the risk of relying on older popular benchmarks and support the shift toward newer datasets built with contamination in mind.

Summary

The paper reveals that high performance on SWE-Bench-Verified is mainly due to pretraining data overlap rather than genuine patch synthesis or reasoning ability.
The methodology isolates a minimally contextualized file localization task, showing stark accuracy differences: around 65% on SWE-Bench-Verified versus 12–19% on decontaminated datasets.
These results imply that current leaderboard metrics may misrepresent true agentic skill, stressing the need for continuously refreshed, contamination-free benchmarks.

Assessing SWE-Bench-Verified: Disentangling Agentic Skill from Model Memorization

Introduction

SWE-Bench-Verified is the dominant benchmark for comparative evaluation of LLM-enabled software agents on real-world GitHub issue resolution. However, its construction from popular open-source repositories—many of which predate LLM training cutoffs—raises the risk that high benchmark scores reflect training set recall rather than genuine patch synthesis or reasoning ability. This paper conducts a focused forensic examination of SWE-Bench-Verified, probing the extent to which contemporary LLMs—specifically Claude Sonnet 3.5 and 3.7—exhibit recall-based artefacts rather than agentic skill when tasked with minimally-contextualized localization on this benchmark. Empirical results are contrasted with performance on BeetleBox and SWE-rebench, two alternative datasets employing similar project sources but intentionally curated for limited contamination.

Methodology

The analysis isolates the localization subtask: given only an issue description (with or without the file structure tree, but never file content or additional context), the LLM must identify which files are implicated in a fix. This setup sharply restricts the useful information for the model, making successful localization via semantic reasoning essentially infeasible on a novel project or issue. If the LLMs consistently predict ground truth files from this deliberately minimal context, this serves as strong evidence of benchmark exposure during training, whether direct or by near-duplicate patterns.

The Claude Sonnet models were evaluated on:

All 500 SWE-Bench-Verified issues.
500 manually de-duplicated, high-quality issues from BeetleBox (50 unique issues each from five non-SWE-Bench Python repositories).
Both the January and September 2025 splits of SWE-rebench.

The study reports on two localization metrics per dataset and input setting: (1) percentage of issues with all ground truth files identified, and (2) percentage with at least one ground truth file identified.

Results and Analysis

Both Claude models exhibit drastically higher accuracy on SWE-Bench-Verified than on BeetleBox or SWE-rebench, even under severely information-throttled conditions. Specifically, with issue-only input, Claude achieves 65% (3.5) and 63.2% (3.7) for all ground truth files on SWE-Bench-Verified, compared to only 12-19% on the other datasets. For the slightly less stringent at least one file metric, accuracy remains higher by a similar multiplicative factor. When incorporating issue+file structure, the gap persists: SWE-Bench-Verified scores are approximately 4x those of BeetleBox for full localization and 2x those of SWE-rebench.

Notably, these efficacy disparities persist despite all evaluation datasets mining issues from popular Python repositories with similar public exposure profiles. Under randomly assembled inputs, semantic cues alone should not allow any model to solve these tasks at such high rates. The performance differential increases as context is reduced: the less information given, the starker the contrast, further indicting the likelihood of memorization.

The results have key implications:

High performance on SWE-Bench-Verified likely reflects pretraining data overlap, not general or transferable bug localization capability.
Given that similar projects in BeetleBox and SWE-rebench do not yield elevated scores, even for repositories with similar public visibility, the artifact cannot be ascribed merely to "open-source project bias."
The risk is acute for leaderboards and evaluation pipelines that claim to reflect progress in agent capabilities: agents leveraging high-performing LLMs may benefit primarily from pretrained exposure to benchmark artifacts rather than agent design or autonomy.
These findings align with prior work revealing significant contamination in SWE-Bench (Aleithan et al., 9 Oct 2024, Zhou et al., 10 Feb 2025), highlighting the necessity for decontaminated and continuously refreshed benchmarks such as SWE-rebench (Badertdinov et al., 26 May 2025).

Relationship to Prior Work

Previous studies have identified evidence of direct data contamination in SWE-Bench, showing up to 94% of issues pre-date LLM cutoffs and demonstrating performance collapse on post-cutoff datasets (Aleithan et al., 9 Oct 2024). Automated de-duplication analyses uncovered nontrivial leakage rates in SWE-Bench-Verified (10.6% for StarCoder) (Zhou et al., 10 Feb 2025). However, this work extends the argument: rather than simply comparing benchmarks, it directly interrogates models’ abilities on an information-impoverished task that should be logically intractable absent dataset memorization. This approach substantiates stronger claims of memorization-driven evaluation artifacts on SWE-Bench-Verified.

Further, the present results cohere with recently surfaced evidence from independent teams, e.g., high file path prediction accuracy with issue-only settings on SWE-Bench-Verified but not on held-out repositories (Liang et al., 14 Jun 2025). Other deficiencies—such as weak test sets (Aleithan et al., 9 Oct 2024, Wang et al., 19 Mar 2025) and lack of cross-language coverage (Yang et al., 4 Oct 2024, Zan et al., 26 Aug 2024, Zan et al., 3 Apr 2025)—compound the unreliability of SWE-Bench-Verified as a progress metric for robust agentic systems.

Implications and Future Directions

The findings demonstrate that benchmarking on SWE-Bench-Verified does not adequately distinguish LLMs with genuine reasoning or agentic skills from those with strong corpus recall. Theoretical advancements contingent on agent evaluation require benchmarks that are insulated from pretraining contamination, most effectively through ongoing, automated harvesting of post-cutoff, high-quality issues (as in SWE-rebench (Badertdinov et al., 26 May 2025)).

Practically, relying on contaminated benchmarks may mislead research prioritization, encouraging optimization for short-term leaderboard advancement rather than architectural or methodological generalization. Future work should emphasize robust contamination detection, post-hoc decontamination, and continuous benchmark evolution. Cross-lingual, multimodal, and context-agnostic settings should be the new standard for evaluating true agentic proficiency.

Conclusion

SWE-Bench-Verified, while widely adopted, suffers from critical contamination, as evidenced by extreme performance of LLMs under minimal input settings—an effect not seen on syntactically and semantically similar, decontaminated datasets. This disparity indicts present leaderboard results, which may reflect model memory over true localization or patch generation skills. The field must transition to benchmarks engineered around active decontamination and diversification to ensure genuine progress in autonomous software agent research.