Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 63 tok/s

Gemini 2.5 Pro 44 tok/s Pro

GPT-5 Medium 31 tok/s Pro

GPT-5 High 32 tok/s Pro

GPT-4o 86 tok/s Pro

Kimi K2 194 tok/s Pro

GPT OSS 120B 445 tok/s Pro

Claude Sonnet 4.5 35 tok/s Pro

2000 character limit reached

Software Development Life Cycle Perspective: A Survey of Benchmarks for Code Large Language Models and Agents (2505.05283v2)

Published 8 May 2025 in cs.SE and cs.AI

Abstract: Code LLMs (CodeLLMs) and agents have shown great promise in tackling complex software engineering tasks.Compared to traditional software engineering methods, CodeLLMs and agents offer stronger abilities, and can flexibly process inputs and outputs in both natural and code. Benchmarking plays a crucial role in evaluating the capabilities of CodeLLMs and agents, guiding their development and deployment. However, despite their growing significance, there remains a lack of comprehensive reviews of benchmarks for CodeLLMs and agents. To bridge this gap, this paper provides a comprehensive review of existing benchmarks for CodeLLMs and agents, studying and analyzing 181 benchmarks from 461 relevant papers, covering the different phases of the software development life cycle (SDLC). Our findings reveal a notable imbalance in the coverage of current benchmarks, with approximately 60% focused on the software development phase in SDLC, while requirements engineering and software design phases receive minimal attention at only 5% and 3%, respectively. Additionally, Python emerges as the dominant programming language across the reviewed benchmarks. Finally, this paper highlights the challenges of current research and proposes future directions, aiming to narrow the gap between the theoretical capabilities of CodeLLMs and agents and their application in real-world scenarios.

Summary

Overview of Benchmarks for CodeLLMs and Agents in Software Engineering

The paper, Software Development Life Cycle Perspective: A Survey of Benchmarks for CodeLLMs and Agents, provides a systematic review of benchmarks used to evaluate CodeLLMs and agents in software engineering tasks, mapping them onto different phases of the Software Development Life Cycle (SDLC). It assesses 181 benchmarks from 461 papers and presents an insightful analysis of their coverage, usage, and programming language prevalence, thus identifying the gaps in current research and proposing directions for future work.

Key Insights and Numerical Highlights

The survey reveals an uneven distribution of benchmarks across the SDLC phases. Notably, approximately 60% of current benchmarks focus on the software development phase, while requirements engineering and software design are significantly underrepresented, receiving only 5% and 3% of the focus, respectively. Python is identified as the dominant programming language among the benchmarks.

Implications and Challenges

A noteworthy implication of this research is the recognition of a substantial gap between the real-world applicability of CodeLLMs and their theoretical capabilities, suggesting a misalignment in benchmark coverage across different phases of SDLC. The potential of CodeLLMs to replace human effort in various software engineering activities necessitates more comprehensive, standardized benchmarks.

Challenges identified include insufficient standardization, limited domain diversity, and a lack of end-to-end evaluation frameworks that simulate realistic software engineering workflows. Additionally, the absence of benchmarks addressing non-functional requirements and the reliance on single-modality inputs are highlighted as areas requiring attention.

Future Directions

The paper suggests several future directions:

Standardization Across Phases: Develop benchmarks that provide structured assessments across all tasks within the requirements engineering and design phases.
Realistic Application Scenarios: Enhance the realism of benchmarks, simulating real-world scenarios across software development tasks, such as repository-level code comprehension.
Comprehensive Evaluation Frameworks: Design cross-phase benchmarks that assess the interconnected dependencies across the SDLC, offering end-to-end evaluation in complex software engineering environments.
Multimodal and Collaborative Benchmarks: Incorporate multimodal inputs and evaluate human–model collaboration to reflect practical software development settings more accurately.

In conclusion, while current benchmarks provide a robust foundation for evaluating CodeLLMs in software engineering, their limitations necessitate strategic expansions in scope and depth to better encompass the entire SDLC. This paper lays essential groundwork for future research, promoting advancements that will elevate the practical impact of CodeLLMs and agents in software engineering practices.