SWE-Bench+: Enhanced Coding Benchmark for LLMs (2410.06992v2)

Published 9 Oct 2024 in cs.SE

Abstract: LLMs in Software Engineering (SE) can offer assistance for coding. To facilitate a rigorous evaluation of LLMs in practical coding contexts, Carlos et al. introduced the SWE-bench dataset, which comprises 2,294 real-world GitHub issues and their corresponding pull requests, collected from 12 widely used Python repositories. Several impressive LLM-based toolkits recently are developed and evaluated on this dataset. However, a systematic evaluation of the quality of SWE-bench remains missing. In this paper, we addressed this gap by presenting an empirical analysis of the SWE-bench dataset. We conducted a manual screening of instances where SWEAgent + GPT-4 successfully resolved issues by comparing the model-generated patches with the actual pull requests. SWE-Agent+GPT-4 was at the top of SWE-bench leaderboard during the time of our study. Our analysis reveals some critical issues with the SWE-bench dataset: 1) 32.67% of the successful patches involve cheating as the solutions were directly provided in the issue report or the comments. We refer to as solution leakage problem. 2) 31.08% of the passed patches are suspicious patches due to weak test cases, i.e., the tests were not adequate to verify the correctness of a patch. When we filtered out these problematic issues, the resolution rate of SWE-Agent+GPT-4 dropped from 12.47% to 3.97%. We also observed that the same data quality issues also exist in the two variants of SWE-bench, i.e., SWE-bench Lite and SWE-Bench Verified. In addition, over 94% of the issues were created before LLM's knowledge cutoff dates, posing potential data leakage issues.

PDF HTML Abstract

This paper presents an empirical analysis of the SWE-bench dataset, a benchmark designed to evaluate the capability of LLMs in resolving real-world GitHub issues. The analysis reveals significant quality issues within the original SWE-bench dataset and its variants (Lite and Verified). Based on these findings, the authors introduce SWE-Bench+, an enhanced version aimed at providing a more rigorous evaluation platform.

Analysis of SWE-Bench:

The authors manually analyzed 251 instances from the SWE-bench Full dataset that were successfully resolved by SWE-Agent + GPT-4 (a top-performing open-source model at the time). Their analysis identified several critical problems:

Solution Leakage: In 32.67% of the successfully resolved instances, the exact solution or strong hints were present directly within the GitHub issue description or comments provided as input to the LLM. This means the model could often copy the solution rather than generating it, questioning the benchmark's validity in assessing true problem-solving skills.
Weak Test Cases: In 31.08% of the passed instances, the patches generated by the LLM were actually incorrect, incomplete, or modified different files/functions than the ground-truth ("gold") patch. Despite these flaws, the patches passed the associated test suites, indicating the tests were not sufficiently robust to verify correctness.
Potential Data Leakage: Over 94% of the issues in SWE-bench were created before the knowledge cut-off dates of commonly used LLMs (like GPT-4), suggesting the models might have encountered these issues during their pre-training phase.

Due to these issues, particularly solution leakage and weak tests (collectively termed "suspicious fixes"), the actual resolution rate of SWE-Agent + GPT-4 on SWE-bench Full dropped significantly from the initially reported 12.47% to just 3.97% when only considering correctly generated, non-suspicious patches. Similar issues regarding solution leakage and weak tests were also found in the SWE-bench Lite and SWE-bench Verified subsets, leading to inflated reported success rates for models evaluated on them.

Introducing SWE-Bench+:

To address the identified limitations, the authors created SWE-Bench+. This new dataset was constructed using the following principles:

Preventing Data Leakage: Issues were collected from 11 of the original 12 Python repositories (excluding Django due to its move from GitHub issues) but were restricted to those created after the knowledge cut-off dates of the evaluated LLMs (specifically, from November 1, 2023, to August 22, 2024).
Eliminating Solution Leakage: During curation, instances where the solution was explicitly provided in the issue report or comments were manually filtered out.
Consistent Methodology: The data collection and filtering process otherwise followed the methodology of the original SWE-bench paper to ensure comparability.

SWE-Bench+ comprises 548 task instances. It has no solution leakage issues by design and exhibits a lower proportion of weak test cases compared to the original SWE-bench variants.

Evaluation on SWE-Bench+:

Several state-of-the-art models were evaluated on SWE-Bench+: SWE-RAG + GPT-4, SWE-RAG + GPT-3.5, SWE-Agent + GPT-4, and AutoCodeRover + GPT-4o. The evaluation followed a rigorous process, including manual validation of patches that passed automated tests.

While solution leakage was eliminated, the problem of weak tests persisted. A significant percentage (around 67.72% on average across models) of patches that passed the tests were found to be incorrect, incomplete, or modifying the wrong code sections upon manual inspection.
The validated resolution rates on SWE-Bench+ were drastically lower than those reported on the original SWE-bench leaderboard:
- SWE-RAG + GPT-4: 0.73% (vs. 1.31% on SWE-bench)
- SWE-RAG + GPT-3.5: 0.55% (vs. 0.17% on SWE-bench)
- SWE-Agent + GPT-4: 0.55% (vs. 3.97% on corrected SWE-bench)
- AutoCodeRover + GPT-4o: 3.83% (vs. 18.83% on SWE-bench)

This significant drop highlights that the original benchmark substantially overestimated model capabilities due to the identified flaws.

Cost-Effectiveness Analysis:

The paper also analyzes the computational cost and time required by the different models on SWE-Bench+. It highlights a trade-off between performance and cost. For instance, SWE-Agent+GPT-4 was the most expensive ($3.59/instance avg,$655 per correct fix) and relatively slow (4 min/instance), while AutoCodeRover+GPT-4o achieved the highest accuracy but was also costly ($0.46/instance avg,$12.61 per correct fix) and slow (4.5 min/instance). RAG-based approaches were generally faster and cheaper per instance, but their cost per correct fix varied based on their lower success rates.

Conclusion:

The paper concludes that the original SWE-bench benchmark suffers from critical flaws (solution leakage, weak tests, potential data leakage) that lead to inflated performance metrics for LLMs. The proposed SWE-Bench+ dataset provides a more robust and reliable benchmark by addressing solution and data leakage issues. Evaluations on SWE-Bench+ show significantly lower, likely more realistic, resolution rates for current state-of-the-art models, emphasizing the remaining challenges in automated software engineering. Future work should focus on improving test case robustness within benchmarks like SWE-Bench+.