Papers
Topics
Authors
Recent
Search
2000 character limit reached

ProjDevBench: End-to-End AI Coding Evaluation

Updated 9 February 2026
  • ProjDevBench is a comprehensive benchmark that evaluates AI coding agents' ability to generate complete, multi-file software projects from natural language specifications.
  • It assesses system-level architectural synthesis, cross-module dependency management, and iterative, feedback-driven refinement using rigorous functional and quality criteria.
  • The benchmark covers diverse challenges—from algorithmic tasks to complex system configurations—while employing metrics from online judge execution and LLM-assisted code review.

ProjDevBench is a comprehensive benchmark designed to evaluate AI coding agents on the full spectrum of end-to-end project development. Unlike preceding benchmarks that target patch-level bug fixing or function-level synthesis, ProjDevBench systematically assesses agents’ abilities to autonomously interpret real-world requirements, synthesize runnable codebases with correct system architecture, and iteratively refine their solutions in alignment with rigorous functional and code quality criteria. The suite was introduced to close the gap between short-form code generation tasks and the broader, multi-stage workflows characteristic of professional software development (Lu et al., 2 Feb 2026).

1. Rationale and Distinctive Scope

Contemporary code generation benchmarks—examples include HumanEval, MBPP, SWE-bench, and RepoBench—primarily focus on unit-level or bug-fix tasks. These benchmarks lack coverage of workflows wherein an agent receives a high-level, natural language specification and is expected to deliver a fully executable multi-file software project. To address this limitation, ProjDevBench was conceived to evaluate AI coders holistically, including their capabilities in system-level architectural synthesis, cross-module dependency management, functional correctness across entire executables, and iterative, feedback-driven refinement cycles. ProjDevBench mandates that agents start either from a blank slate or minimal scaffold, requiring the generation and organization of all project artifacts, including directory structures and build configuration files (Lu et al., 2 Feb 2026).

2. Task Format and Benchmark Construction

ProjDevBench comprises 20 meticulously curated programming problems, grouped into eight categories, covering both algorithmically-focused and application-oriented software engineering challenges. Each task is specified using a natural-language prompt delineating required functionalities, input/output formats, explicit resource constraints (time, memory), and any restrictions on libraries or coding patterns. The benchmark supports two evaluation regimes:

  • Project-Creation (Hard subset): The agent begins with no starter code and must synthesize the entire project.
  • Project-Completion (Easy subset): Agents receive a partial implementation to be completed.

Curated tasks span:

  • Pure algorithmic computation (e.g., A+B Problem, Big Integer Arithmetic)
  • Data structure libraries (e.g., STLite vector, priority queue)
  • Multi-module management systems (e.g., ICPC contest manager, bookstore system)
  • Storage backends (including B+ tree implementations)
  • Interpreters (BASIC, Python, Scheme)
  • System-level infrastructure (assembly challenges, GPU memory optimizers)
  • Real-world-style applications requiring nuanced handling of CMake configurations, modular separation, and exception safety.

A high-level task category summary is provided below:

Category Example Tasks Complexity Dimension
Algorithm A+B, Big Integer Arithmetic, QOI Codec Core algorithmic code
Data Structure STLite Vector/Priority Queue/Map Library API, containers
Management Sys ICPC Manager, Bookstore, Train System Multi-module, I/O, config
Storage File Storage, B+ Tree Storage Disk IO, indexing
Interpreters BASIC, Python, Scheme Interpreters Parsing, evaluation
Game Minesweeper UI logic, rule handling
Assembly mov-language Challenges Low-level code
Optimization GPU Memory Opt, Buddy Allocator Performance, resource mgmt

All agents are tasked with initializing repositories, managing version control, and integrating feedback from automated test verdicts to incrementally refine their solutions.

3. Evaluation Pipeline and Metrics

ProjDevBench implements a two-stage evaluation framework:

3.1 Stage 1: Online Judge (OJ) Execution

Codebases are built (typically via CMake) and subjected to a battery of NN test cases with strict resource limits. Agents receive detailed verdicts:

  • Compile Error
  • Runtime Error
  • Wrong Answer
  • Time Limit Exceeded
  • Memory Limit Exceeded
  • Memory Leak (via leak-checking harnesses)
  • Others

An execution score SexecS_{\mathrm{exec}} is computed:

Sexec=i=1Nwi1[testi passed]i=1Nwi×100S_{\mathrm{exec}} = \frac{\sum_{i=1}^N w_i\, \mathbf{1}[\textrm{test}_i~\textrm{passed}]}{\sum_{i=1}^N w_i} \times 100

3.2 Stage 2: LLM-Assisted Code Review (CR)

Static and LLM-based checks are performed for:

  • Rule-based constraints: forbidden headers, required file presence, artifact cleanliness
  • System architecture: directory layout, modularity, CMake target structure
  • Code clarity and maintainability

A normalized score Scr[0,100]S_{\mathrm{cr}} \in [0, 100] reflects adherence to these criteria, computed via LLM-based judgements on the submitted repositories.

3.3 Final Acceptance and Scoring

An overall score combines both execution and code review phases:

Sfinal=0.8Sexec+0.2ScrS_{\mathrm{final}} = 0.8\, S_{\mathrm{exec}} + 0.2\, S_{\mathrm{cr}}

The acceptance rate, the proportion of projects passing all criteria, was observed as:

Acceptance Rate=#Accepted#Attempts=27.38%\text{Acceptance Rate} = \frac{\# \text{Accepted}}{\# \text{Attempts}} = 27.38\%

Submission-level breakdown (all agents, all tasks):

Verdict Count %
Accepted 484 27.38
Wrong Answer 740 41.86
Time Limit Exceeded 246 13.91
Runtime Error 124 7.01
Compile Error 80 4.52
Memory Leak 62 3.51
Memory Limit Exceeded 24 1.36
Others 8 0.45

4. Experimental Protocol and Agent Performance

Six coding agents—Augment, Codex CLI, Cursor, GitHub Copilot, Gemini CLI, Claude Code—were evaluated, each paired with up to three LLM backends (e.g., GPT-5, Claude Sonnet 4.5, Gemini 3 Pro, with open-source models for Claude Code). Sessions were executed in isolated Docker containers equipped with CMake, gcc-13, g++-13, Python, OJ CLI, and the agent CLIs.

Key empirical results:

  • Agents engaged in a median of 138 interaction turns and consumed 4.81 million tokens per problem.
  • Average agent scores (across all problems):
Agent + Model Exec Final
Codex + GPT-5 76.73 77.85
Cursor + Gemini–3–Pro 72.52 75.32
Augment + GPT-5 72.13 72.35
Cursor + GPT-5 69.26 71.85
Copilot + Sonnet 4.5 62.48 67.18
Claude Code + Sonnet 4.5 63.76 68.87
Open-source (GLM-4.6) 52.00 57.95

Performance on "Hard" (from-scratch) creation tasks declined by 10–20 points relative to the "Easy" completion subset. Agents demonstrated strong competence on tasks emphasizing algorithm and data structure implementation (>80% execution score), but underperformed on multi-module systems that entailed nuanced system design and configuration (often <20% exec).

5. Diagnostic Findings and Common Failure Modes

ProjDevBench analysis revealed characteristic challenges faced by current agent architectures:

  1. Specification Misalignment: Frequent omission of required modules or inaccurate API usage.
  2. Edge-case Handling: Deficits in robustly addressing empty inputs, exception propagation, and I/O corner cases.
  3. Time Complexity and Performance: Suboptimal algorithms resulting in violations of resource requirements (e.g., O(NlogN)O(N\log N) sorting rather than optimal O(N)O(N) methods).
  4. Resource Management: Errors attributable to improper manual memory management (e.g., missing RAII, memory leaks, stack overflows).
  5. Project Engineering Deficiencies: Misconfigured build systems (CMake), missing essential files (e.g., .gitignore), or broken CI pipelines.

A plausible implication is that further research is needed into bridging the gap between local reasoning (single-function or snippet) and global project synthesis, particularly under the constraints of real-world build environments and complex task specifications.

6. Directions for Benchmark Extension and Future Work

The creators of ProjDevBench outline several axes for future enhancement:

  • Broadening language and ecosystem coverage beyond C++/Python to include languages like Java and Rust, along with build systems such as Maven, Gradle, and Cargo.
  • Incorporating interactive debugging and human-in-the-loop workflows to more closely mirror practical software engineering processes.
  • Expanding the task set to cover approximately 50+ complex real-world challenge projects, including web services, microservices, and full-stack applications.

Additionally, improved agent prompt-grounding for explicit constraint enforcement and deeper integration of performance-aware and resource-safe code generation heuristics (e.g., RAII, smart-pointers) are recommended to systematically reduce prevalent error modes.

ProjDevBench establishes a rigorous diagnostic standard for evaluating full-pipeline AI agent performance in real-world project development, complementing and extending beyond fine-grained, developer-centric benchmarks such as Codev-Bench (Pan et al., 2024, Lu et al., 2 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ProjDevBench.