ProjDevBench: End-to-End AI Coding Evaluation

Updated 9 February 2026

ProjDevBench is a comprehensive benchmark that evaluates AI coding agents' ability to generate complete, multi-file software projects from natural language specifications.
It assesses system-level architectural synthesis, cross-module dependency management, and iterative, feedback-driven refinement using rigorous functional and quality criteria.
The benchmark covers diverse challenges—from algorithmic tasks to complex system configurations—while employing metrics from online judge execution and LLM-assisted code review.

ProjDevBench is a comprehensive benchmark designed to evaluate AI coding agents on the full spectrum of end-to-end project development. Unlike preceding benchmarks that target patch-level bug fixing or function-level synthesis, ProjDevBench systematically assesses agents’ abilities to autonomously interpret real-world requirements, synthesize runnable codebases with correct system architecture, and iteratively refine their solutions in alignment with rigorous functional and code quality criteria. The suite was introduced to close the gap between short-form code generation tasks and the broader, multi-stage workflows characteristic of professional software development (Lu et al., 2 Feb 2026).

1. Rationale and Distinctive Scope

Contemporary code generation benchmarks—examples include HumanEval, MBPP, SWE-bench, and RepoBench—primarily focus on unit-level or bug-fix tasks. These benchmarks lack coverage of workflows wherein an agent receives a high-level, natural language specification and is expected to deliver a fully executable multi-file software project. To address this limitation, ProjDevBench was conceived to evaluate AI coders holistically, including their capabilities in system-level architectural synthesis, cross-module dependency management, functional correctness across entire executables, and iterative, feedback-driven refinement cycles. ProjDevBench mandates that agents start either from a blank slate or minimal scaffold, requiring the generation and organization of all project artifacts, including directory structures and build configuration files (Lu et al., 2 Feb 2026).

2. Task Format and Benchmark Construction

ProjDevBench comprises 20 meticulously curated programming problems, grouped into eight categories, covering both algorithmically-focused and application-oriented software engineering challenges. Each task is specified using a natural-language prompt delineating required functionalities, input/output formats, explicit resource constraints (time, memory), and any restrictions on libraries or coding patterns. The benchmark supports two evaluation regimes:

Project-Creation (Hard subset): The agent begins with no starter code and must synthesize the entire project.
Project-Completion (Easy subset): Agents receive a partial implementation to be completed.

Curated tasks span:

Pure algorithmic computation (e.g., A+B Problem, Big Integer Arithmetic)
Data structure libraries (e.g., STLite vector, priority queue)
Multi-module management systems (e.g., ICPC contest manager, bookstore system)
Storage backends (including B+ tree implementations)
Interpreters (BASIC, Python, Scheme)
System-level infrastructure (assembly challenges, GPU memory optimizers)
Real-world-style applications requiring nuanced handling of CMake configurations, modular separation, and exception safety.

A high-level task category summary is provided below:

Category	Example Tasks	Complexity Dimension
Algorithm	A+B, Big Integer Arithmetic, QOI Codec	Core algorithmic code
Data Structure	STLite Vector/Priority Queue/Map	Library API, containers
Management Sys	ICPC Manager, Bookstore, Train System	Multi-module, I/O, config
Storage	File Storage, B+ Tree Storage	Disk IO, indexing
Interpreters	BASIC, Python, Scheme Interpreters	Parsing, evaluation
Game	Minesweeper	UI logic, rule handling
Assembly	mov-language Challenges	Low-level code
Optimization	GPU Memory Opt, Buddy Allocator	Performance, resource mgmt

All agents are tasked with initializing repositories, managing version control, and integrating feedback from automated test verdicts to incrementally refine their solutions.

3. Evaluation Pipeline and Metrics

ProjDevBench implements a two-stage evaluation framework:

3.1 Stage 1: Online Judge (OJ) Execution

Codebases are built (typically via CMake) and subjected to a battery of $N$ test cases with strict resource limits. Agents receive detailed verdicts:

Compile Error
Runtime Error
Wrong Answer
Time Limit Exceeded
Memory Limit Exceeded
Memory Leak (via leak-checking harnesses)
Others

An execution score $S_{\mathrm{exec}}$ is computed:

$S_{\mathrm{exec}} = \frac{\sum_{i=1}^N w_i\, \mathbf{1}[\textrm{test}_i~\textrm{passed}]}{\sum_{i=1}^N w_i} \times 100$

3.2 Stage 2: LLM-Assisted Code Review (CR)

Static and LLM-based checks are performed for:

Rule-based constraints: forbidden headers, required file presence, artifact cleanliness
System architecture: directory layout, modularity, CMake target structure
Code clarity and maintainability

A normalized score $S_{\mathrm{cr}} \in [0, 100]$ reflects adherence to these criteria, computed via LLM-based judgements on the submitted repositories.

3.3 Final Acceptance and Scoring

An overall score combines both execution and code review phases:

$S_{\mathrm{final}} = 0.8\, S_{\mathrm{exec}} + 0.2\, S_{\mathrm{cr}}$

The acceptance rate, the proportion of projects passing all criteria, was observed as:

$\text{Acceptance Rate} = \frac{\# \text{Accepted}}{\# \text{Attempts}} = 27.38\%$

Submission-level breakdown (all agents, all tasks):

Verdict	Count	%
Accepted	484	27.38
Wrong Answer	740	41.86
Time Limit Exceeded	246	13.91
Runtime Error	124	7.01
Compile Error	80	4.52
Memory Leak	62	3.51
Memory Limit Exceeded	24	1.36
Others	8	0.45

4. Experimental Protocol and Agent Performance

Six coding agents—Augment, Codex CLI, Cursor, GitHub Copilot, Gemini CLI, Claude Code—were evaluated, each paired with up to three LLM backends (e.g., GPT-5, Claude Sonnet 4.5, Gemini 3 Pro, with open-source models for Claude Code). Sessions were executed in isolated Docker containers equipped with CMake, gcc-13, g++-13, Python, OJ CLI, and the agent CLIs.

Key empirical results:

Agents engaged in a median of 138 interaction turns and consumed 4.81 million tokens per problem.
Average agent scores (across all problems):

Agent + Model	Exec	Final
Codex + GPT-5	76.73	77.85
Cursor + Gemini–3–Pro	72.52	75.32
Augment + GPT-5	72.13	72.35
Cursor + GPT-5	69.26	71.85
Copilot + Sonnet 4.5	62.48	67.18
Claude Code + Sonnet 4.5	63.76	68.87
Open-source (GLM-4.6)	52.00	57.95

Performance on "Hard" (from-scratch) creation tasks declined by 10–20 points relative to the "Easy" completion subset. Agents demonstrated strong competence on tasks emphasizing algorithm and data structure implementation (>80% execution score), but underperformed on multi-module systems that entailed nuanced system design and configuration (often <20% exec).

5. Diagnostic Findings and Common Failure Modes

ProjDevBench analysis revealed characteristic challenges faced by current agent architectures:

Specification Misalignment: Frequent omission of required modules or inaccurate API usage.
Edge-case Handling: Deficits in robustly addressing empty inputs, exception propagation, and I/O corner cases.
Time Complexity and Performance: Suboptimal algorithms resulting in violations of resource requirements (e.g., $O(N\log N)$ sorting rather than optimal $O(N)$ methods).
Resource Management: Errors attributable to improper manual memory management (e.g., missing RAII, memory leaks, stack overflows).
Project Engineering Deficiencies: Misconfigured build systems (CMake), missing essential files (e.g., .gitignore), or broken CI pipelines.

A plausible implication is that further research is needed into bridging the gap between local reasoning (single-function or snippet) and global project synthesis, particularly under the constraints of real-world build environments and complex task specifications.

6. Directions for Benchmark Extension and Future Work

The creators of ProjDevBench outline several axes for future enhancement:

Broadening language and ecosystem coverage beyond C++/Python to include languages like Java and Rust, along with build systems such as Maven, Gradle, and Cargo.
Incorporating interactive debugging and human-in-the-loop workflows to more closely mirror practical software engineering processes.
Expanding the task set to cover approximately 50+ complex real-world challenge projects, including web services, microservices, and full-stack applications.

Additionally, improved agent prompt-grounding for explicit constraint enforcement and deeper integration of performance-aware and resource-safe code generation heuristics (e.g., RAII, smart-pointers) are recommended to systematically reduce prevalent error modes.

ProjDevBench establishes a rigorous diagnostic standard for evaluating full-pipeline AI agent performance in real-world project development, complementing and extending beyond fine-grained, developer-centric benchmarks such as Codev-Bench (Pan et al., 2024, Lu et al., 2 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (2)

ProjDevBench: Benchmarking AI Coding Agents on End-to-End Project Development (2026)

Codev-Bench: How Do LLMs Understand Developer-Centric Code Completion? (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ProjDevBench.