A Real-World Benchmark for Evaluating Fine-Grained Issue Solving Capabilities of Large Language Models

Published 27 Nov 2024 in cs.SE | (2411.18019v1)

Abstract: Automatically resolving software issues is crucial for software development in practice, impacting the software quality and user experience. The process of resolving real-world issues encompasses tasks such as question-answering (QA), fault localization, and code editing. Existing benchmarks such as HumanEval fall short in their ability to assess LLMs' proficiency in solving issues within a codebase. Although benchmarks like SWE-Bench are designed to evaluate the LLMs' capability to handle real-world GitHub issues, the end-to-end evaluation method cannot provide granular insights on the performance of subtasks involved in issue solving. To address existing deficiencies in benchmarking LLMs for practical software engineering tasks, we introduce FAUN-Eval, a benchmark specifically designed to evaluate the Fine-grAined issUe solviNg capabilities of LLMs. FAUN-Eval systematically assesses LLMs across three distinct tasks: QA, fault localization, and code editing. This benchmark is constructed using a dataset curated from 30 well-known GitHub repositories. For each entry, issue and pull request (PR) pairs are meticulously compiled and validated using cross-referencing and keyword verification methods. FAUN-Eval includes 300 entries and employs both LLM and manual checks to ensure data quality. We evaluate ten LLMs with FAUN-Eval, including four closed-source and six open-source models. Our experimental results reveal several key findings. We find that the top-performing LLMs differ across the different tasks. Additionally, features in issues may lead LLMs to generate incorrect information. Moreover, models may vary in their proficiency with texts of different lengths.