ProjDevBench: End-to-End AI Coding Evaluation
- ProjDevBench is a comprehensive benchmark that evaluates AI coding agents' ability to generate complete, multi-file software projects from natural language specifications.
- It assesses system-level architectural synthesis, cross-module dependency management, and iterative, feedback-driven refinement using rigorous functional and quality criteria.
- The benchmark covers diverse challenges—from algorithmic tasks to complex system configurations—while employing metrics from online judge execution and LLM-assisted code review.
ProjDevBench is a comprehensive benchmark designed to evaluate AI coding agents on the full spectrum of end-to-end project development. Unlike preceding benchmarks that target patch-level bug fixing or function-level synthesis, ProjDevBench systematically assesses agents’ abilities to autonomously interpret real-world requirements, synthesize runnable codebases with correct system architecture, and iteratively refine their solutions in alignment with rigorous functional and code quality criteria. The suite was introduced to close the gap between short-form code generation tasks and the broader, multi-stage workflows characteristic of professional software development (Lu et al., 2 Feb 2026).
1. Rationale and Distinctive Scope
Contemporary code generation benchmarks—examples include HumanEval, MBPP, SWE-bench, and RepoBench—primarily focus on unit-level or bug-fix tasks. These benchmarks lack coverage of workflows wherein an agent receives a high-level, natural language specification and is expected to deliver a fully executable multi-file software project. To address this limitation, ProjDevBench was conceived to evaluate AI coders holistically, including their capabilities in system-level architectural synthesis, cross-module dependency management, functional correctness across entire executables, and iterative, feedback-driven refinement cycles. ProjDevBench mandates that agents start either from a blank slate or minimal scaffold, requiring the generation and organization of all project artifacts, including directory structures and build configuration files (Lu et al., 2 Feb 2026).
2. Task Format and Benchmark Construction
ProjDevBench comprises 20 meticulously curated programming problems, grouped into eight categories, covering both algorithmically-focused and application-oriented software engineering challenges. Each task is specified using a natural-language prompt delineating required functionalities, input/output formats, explicit resource constraints (time, memory), and any restrictions on libraries or coding patterns. The benchmark supports two evaluation regimes:
- Project-Creation (Hard subset): The agent begins with no starter code and must synthesize the entire project.
- Project-Completion (Easy subset): Agents receive a partial implementation to be completed.
Curated tasks span:
- Pure algorithmic computation (e.g., A+B Problem, Big Integer Arithmetic)
- Data structure libraries (e.g., STLite vector, priority queue)
- Multi-module management systems (e.g., ICPC contest manager, bookstore system)
- Storage backends (including B+ tree implementations)
- Interpreters (BASIC, Python, Scheme)
- System-level infrastructure (assembly challenges, GPU memory optimizers)
- Real-world-style applications requiring nuanced handling of CMake configurations, modular separation, and exception safety.
A high-level task category summary is provided below:
| Category | Example Tasks | Complexity Dimension |
|---|---|---|
| Algorithm | A+B, Big Integer Arithmetic, QOI Codec | Core algorithmic code |
| Data Structure | STLite Vector/Priority Queue/Map | Library API, containers |
| Management Sys | ICPC Manager, Bookstore, Train System | Multi-module, I/O, config |
| Storage | File Storage, B+ Tree Storage | Disk IO, indexing |
| Interpreters | BASIC, Python, Scheme Interpreters | Parsing, evaluation |
| Game | Minesweeper | UI logic, rule handling |
| Assembly | mov-language Challenges | Low-level code |
| Optimization | GPU Memory Opt, Buddy Allocator | Performance, resource mgmt |
All agents are tasked with initializing repositories, managing version control, and integrating feedback from automated test verdicts to incrementally refine their solutions.
3. Evaluation Pipeline and Metrics
ProjDevBench implements a two-stage evaluation framework:
3.1 Stage 1: Online Judge (OJ) Execution
Codebases are built (typically via CMake) and subjected to a battery of test cases with strict resource limits. Agents receive detailed verdicts:
- Compile Error
- Runtime Error
- Wrong Answer
- Time Limit Exceeded
- Memory Limit Exceeded
- Memory Leak (via leak-checking harnesses)
- Others
An execution score is computed:
3.2 Stage 2: LLM-Assisted Code Review (CR)
Static and LLM-based checks are performed for:
- Rule-based constraints: forbidden headers, required file presence, artifact cleanliness
- System architecture: directory layout, modularity, CMake target structure
- Code clarity and maintainability
A normalized score reflects adherence to these criteria, computed via LLM-based judgements on the submitted repositories.
3.3 Final Acceptance and Scoring
An overall score combines both execution and code review phases:
The acceptance rate, the proportion of projects passing all criteria, was observed as:
Submission-level breakdown (all agents, all tasks):
| Verdict | Count | % |
|---|---|---|
| Accepted | 484 | 27.38 |
| Wrong Answer | 740 | 41.86 |
| Time Limit Exceeded | 246 | 13.91 |
| Runtime Error | 124 | 7.01 |
| Compile Error | 80 | 4.52 |
| Memory Leak | 62 | 3.51 |
| Memory Limit Exceeded | 24 | 1.36 |
| Others | 8 | 0.45 |
4. Experimental Protocol and Agent Performance
Six coding agents—Augment, Codex CLI, Cursor, GitHub Copilot, Gemini CLI, Claude Code—were evaluated, each paired with up to three LLM backends (e.g., GPT-5, Claude Sonnet 4.5, Gemini 3 Pro, with open-source models for Claude Code). Sessions were executed in isolated Docker containers equipped with CMake, gcc-13, g++-13, Python, OJ CLI, and the agent CLIs.
Key empirical results:
- Agents engaged in a median of 138 interaction turns and consumed 4.81 million tokens per problem.
- Average agent scores (across all problems):
| Agent + Model | Exec | Final |
|---|---|---|
| Codex + GPT-5 | 76.73 | 77.85 |
| Cursor + Gemini–3–Pro | 72.52 | 75.32 |
| Augment + GPT-5 | 72.13 | 72.35 |
| Cursor + GPT-5 | 69.26 | 71.85 |
| Copilot + Sonnet 4.5 | 62.48 | 67.18 |
| Claude Code + Sonnet 4.5 | 63.76 | 68.87 |
| Open-source (GLM-4.6) | 52.00 | 57.95 |
Performance on "Hard" (from-scratch) creation tasks declined by 10–20 points relative to the "Easy" completion subset. Agents demonstrated strong competence on tasks emphasizing algorithm and data structure implementation (>80% execution score), but underperformed on multi-module systems that entailed nuanced system design and configuration (often <20% exec).
5. Diagnostic Findings and Common Failure Modes
ProjDevBench analysis revealed characteristic challenges faced by current agent architectures:
- Specification Misalignment: Frequent omission of required modules or inaccurate API usage.
- Edge-case Handling: Deficits in robustly addressing empty inputs, exception propagation, and I/O corner cases.
- Time Complexity and Performance: Suboptimal algorithms resulting in violations of resource requirements (e.g., sorting rather than optimal methods).
- Resource Management: Errors attributable to improper manual memory management (e.g., missing RAII, memory leaks, stack overflows).
- Project Engineering Deficiencies: Misconfigured build systems (CMake), missing essential files (e.g.,
.gitignore), or broken CI pipelines.
A plausible implication is that further research is needed into bridging the gap between local reasoning (single-function or snippet) and global project synthesis, particularly under the constraints of real-world build environments and complex task specifications.
6. Directions for Benchmark Extension and Future Work
The creators of ProjDevBench outline several axes for future enhancement:
- Broadening language and ecosystem coverage beyond C++/Python to include languages like Java and Rust, along with build systems such as Maven, Gradle, and Cargo.
- Incorporating interactive debugging and human-in-the-loop workflows to more closely mirror practical software engineering processes.
- Expanding the task set to cover approximately 50+ complex real-world challenge projects, including web services, microservices, and full-stack applications.
Additionally, improved agent prompt-grounding for explicit constraint enforcement and deeper integration of performance-aware and resource-safe code generation heuristics (e.g., RAII, smart-pointers) are recommended to systematically reduce prevalent error modes.
ProjDevBench establishes a rigorous diagnostic standard for evaluating full-pipeline AI agent performance in real-world project development, complementing and extending beyond fine-grained, developer-centric benchmarks such as Codev-Bench (Pan et al., 2024, Lu et al., 2 Feb 2026).