ProgramBench Benchmark Framework
- ProgramBench is a benchmark framework that evaluates LM agents' ability to reconstruct full software projects using documentation and a reference executable.
- It employs a four-phase methodology including repository selection, executable creation, fuzzing-based test generation, and environment assembly.
- Empirical findings reveal that current models struggle with architectural planning and modular decomposition, underscoring challenges in holistic software construction.
ProgramBench is a benchmark framework that measures the ability of software engineering agents, specifically large LMs, to reconstruct complete software projects from scratch using only documentation and access to a reference executable. Designed to move beyond traditional code-generation micro-benchmarks, ProgramBench targets holistic software construction, demanding that agents make architectural, organizational, and design choices necessary for real-world software engineering tasks (Yang et al., 5 May 2026).
1. Problem Definition and Benchmark Scope
ProgramBench frames the software engineering task as follows: given a compiled executable and associated documentation, an LM-based agent is required to infer, architect, and implement a functionally equivalent codebase. The agent must select the programming language, project layout, build system, modular decomposition, interfaces, and data structures independently. No internal structure, API, or reference implementation is provided; correctness is judged solely on the agent’s solution behaving like the target executable.
Key characteristics:
- End-to-End Development: The agent performs a “zero-to-one” software re-creation, not just patching, refactoring, or editing code.
- Implementation-Agnostic Evaluation: Any code structure satisfying behavioral requirements passes; solutions in any language are allowed.
- Agent Isolation: Agents lack internet access and cannot decompile or wrap the executable; only documentation and observed behaviors inform their solutions.
2. Dataset Construction Methodology
ProgramBench’s dataset comprises 200 tasks automatically extracted from public GitHub repositories. The pipeline features four distinct phases:
- Repository Selection: Candidate projects must be written in compiled languages (C, C++, Rust, Go, Java, Haskell), build a standalone executable, and not require online or interactive dependencies.
- Executable Creation: A specialized agent automates project compilation. The outcome includes a reproducible build script (compile.sh), logs, and frozen dependencies.
- Behavioral Test Generation: Coverage-guided agent-driven fuzzing produces test suites asserting on output, exit codes, and observable side effects, never on implementation internals.
- Inference Environment Assembly: The final environment provides the agent with the executable (with execute-only permissions), usage documentation, and required assets, stripping away sources and network capabilities.
Dataset statistics:
| Metric | Median | Min | Max |
|---|---|---|---|
| LOC | 8,635 | 212 | 2.7M |
| Code files | 50 | 1 | 5,342 |
| Runtime packages | 10 | 0 | 113 |
| GitHub stars | 2,124 | 202 | 79,693 |
| Commits | 646 | 13 | 145,991 |
| Project age (years) | 7.9 | 0.3 | 17.8 |
| Language dist. | Rust (33%), Go (29%), C/C++ (33%), Java/Haskell (~1%) |
3. Fuzzing-Based Behavioral Testing
Behavioral testing leverages an agent-in-the-loop, coverage-guided fuzzing architecture. The process proceeds as follows:
- Iteration: The agent identifies coverage gaps in the gold executable, proposes new tests targeting uncovered paths, and these are added if they cover new paths and pass a linter.
- Coverage Metrics: Coverage cov(T) is the fraction of source lines executed by test suite T; generation halts at τ (typically τ=0.95).
- Assertion Linting: Weak tests—e.g., those only asserting on exit codes, substring matches <15 chars, or containing disjunctions or exception swallowing—are rejected or revised.
This framework ensures high-quality, nontrivial, and robust testing of candidate solutions purely on external behavior, not source-level similarity (Yang et al., 5 May 2026).
4. Evaluation Metrics and Experimental Setup
Performance is measured using several metrics:
- Per-task Pass Rate: For model M and task t with nₜ tests:
- Aggregate Scores: Mean pass rate over 200 tasks.
- Resolution Fractions:
- Fully Resolved: Fraction of tasks with .
- Almost Resolved: Fraction with .
Evaluation harness:
- Nine state-of-the-art LMs are tested (Claude Opus 4.7/4.6, Sonnet 4.6, Haiku 4.5, Gemini 3.1 Pro/Flash, GPT 5.4/mini).
- Agents are isolated in Ubuntu-based containers (20 CPUs, 60GB RAM), receive only documentation and the gold executable, and operate under tight token and time budgets (≤1,000 steps or 6h per task).
| Model | Resolved | Almost (≥95%) | Mean API Calls | Mean Cost (USD) |
|---|---|---|---|---|
| Claude Opus 4.7 | 0.0% | 3.0% | 93 | 3.81 |
| Claude Opus 4.6 | 0.0% | 2.5% | 260 | 11.38 |
| Claude Sonnet 4.6 | 0.0% | 1.6% | 475 | 27.09 |
| Claude Haiku 4.5 | 0.0% | 0.0% | 124 | 0.80 |
| Gemini 3.1 Pro | 0.0% | 0.0% | 94 | 1.51 |
| Gemini 3 Flash | 0.0% | 0.0% | 89 | 0.33 |
| GPT 5.4 | 0.0% | 0.0% | 16 | 0.33 |
| GPT 5.4 mini | 0.0% | 0.0% | 18 | 0.04 |
| GPT 5 mini | 0.0% | 0.0% | 15 | 0.03 |
No model fully resolved any task; the peak “almost resolved” is 3.0%, i.e., 6 tasks out of 200 (Yang et al., 5 May 2026).
5. Empirical Findings and Failure Modes
Codebase Structural Analysis
- Monolithic Implementations: 85% of model solutions are significantly smaller than references (median 1,173 vs. 3,068 LOC). Median file count is 3 (reference: 15), and most solutions use a flat directory structure.
- Function Granularity: Models generate fewer, longer functions (e.g., Sonnet 4.6: 24 functions vs. 44 in reference, average length 35 vs. 24 lines).
- Trajectories: GPT-5.4 models tend to output most of the code in one step; Claude models iterate over hundreds of steps, interleaving probing and writing.
Task Difficulty Distribution
- Simple CLI Tools (e.g.,
nnn,fzf,gron) see pass rates of 60–80%. - Complex Systems (FFmpeg, PHP, Typst) typically result in <10% pass rates.
Failure Modes
- Architectural Collapses: Agents prefer monolithic, single-file implementations, eschewing modular decomposition, explicit interfaces, or file hierarchies.
- Missing Edge Cases: Solutions frequently lack coverage for boundary behaviors and error modes.
- Underspecified Resource Handling: Long functions and absence of defensive programming patterns limit robustness.
6. Benchmark Design Philosophy and Recommendations
ProgramBench’s design emphasizes behavioral evaluation, extensibility, and modular, unbiased assessment of agent capabilities:
- Implementation-Agnosticism: All correct solutions—regardless of design choices or language—are valid if they pass behavioral tests.
- Agent-Driven Test Generation: The coverage- and assertion-focused generation loop ensures test quality and mitigates trivial solution strategies.
- Isolation and Anti-Cheating Controls: Strict environment constraints prevent use of source lookup, wrapping, or other shortcut solutions.
The authors recommend several directions for future research:
- Explicit Architectural Planning: Prompting agents to plan modules and interfaces prior to coding.
- Multi-Agent Decomposition: Tasking specialized sub-agents (planner, coder, tester) with subtasks to tackle project complexity.
- Feedback-Integrated Development: Leveraging static analysis and code complexity metrics to encourage modularity.
- Iterative Human-In-The-Loop Guidance: Allowing for human feedback prior to code generation.
- Richer and More Challenging Specifications: Extending benchmarks with performance and resource constraints and contributions from new domains.
7. Comparative Context and Impact
ProgramBench occupies a distinct niche among software engineering and program synthesis benchmarks:
- Compared to micro-benchmarks or patch-oriented benchmarks (e.g., HumanEval, MBPP, SWE-bench), ProgramBench targets the holistic regeneration of full projects, demanding architectural reasoning (Yang et al., 5 May 2026).
- ProjDevBench, ProBench, and PBEBench provide complementary measurements targeting end-to-end project completion (from spec to codebase), competitive-programming-level algorithmic reasoning, and fine-grained programming-by-example synthesis, respectively (Lu et al., 2 Feb 2026, Yang et al., 28 Feb 2025, Naik et al., 29 May 2025).
- ProgramBench’s methodology serves as a model for large-scale, implementation-agnostic benchmarking using real-world projects, agent-based environment isolation, and behavioral fuzzing for evaluation.
A plausible implication is that as LMs are increasingly deployed as autonomous coding agents, benchmarks such as ProgramBench are essential for exposing deficiencies in system-level reasoning, modular decomposition, and robust software construction, driving both the development and evaluation of next-generation AI programming agents (Yang et al., 5 May 2026).