The SWE-Bench Illusion: When State-of-the-Art LLMs Remember Instead of Reason (2506.12286v2)

Published 14 Jun 2025 in cs.AI and cs.SE

Abstract: As LLMs become increasingly capable and widely adopted, benchmarks play a central role in assessing their practical utility. For example, SWE-Bench Verified has emerged as a critical benchmark for evaluating LLMs' software engineering abilities, particularly their aptitude for resolving real-world GitHub issues. Recent LLMs show impressive performance on SWE-Bench, leading to optimism about their capacity for complex coding tasks. However, current evaluation protocols may overstate these models' true capabilities. It is crucial to distinguish LLMs' generalizable problem-solving ability and other learned artifacts. In this work, we introduce two diagnostic tasks: file path identification from issue descriptions alone, and ground truth function reproduction with only the current file context and issue description to probe models' underlying knowledge. We present empirical evidence that performance gains on SWE-Bench-Verified may be partially driven by memorization rather than genuine problem-solving. We show that state-of-the-art models achieve up to 76% accuracy in identifying buggy file paths using only issue descriptions, without access to repository structure. This performance is merely up to 53% on tasks from repositories not included in SWE-Bench, pointing to possible data contamination or memorization. A similar pattern is also observed for the function reproduction task, where the verbatim similarity is much higher on SWE-Bench-Verified than on other similar coding benchmarks. These findings raise concerns about the validity of existing results and underscore the need for more robust, contamination-resistant benchmarks to reliably evaluate LLMs' coding abilities.

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Related Papers

SWE-bench: Can Language Models Resolve Real-World GitHub Issues? (2023)
SWE-Bench+: Enhanced Coding Benchmark for LLMs (2024)
Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving (2025)
SWE-bench Goes Live! (2025)
SWE-Bench-CL: Continual Learning for Coding Agents (2025)

Tweets

https://twitter.com/rohanpaul_ai/status/1939477522745573475

https://twitter.com/Synced_Global/status/1939528264684646753

https://twitter.com/dimd00d/status/1940783697600696734

YouTube

Show All Videos