ProgramBench Benchmark Framework

Updated 8 May 2026

ProgramBench is a benchmark framework that evaluates LM agents' ability to reconstruct full software projects using documentation and a reference executable.
It employs a four-phase methodology including repository selection, executable creation, fuzzing-based test generation, and environment assembly.
Empirical findings reveal that current models struggle with architectural planning and modular decomposition, underscoring challenges in holistic software construction.

ProgramBench is a benchmark framework that measures the ability of software engineering agents, specifically large LMs, to reconstruct complete software projects from scratch using only documentation and access to a reference executable. Designed to move beyond traditional code-generation micro-benchmarks, ProgramBench targets holistic software construction, demanding that agents make architectural, organizational, and design choices necessary for real-world software engineering tasks (Yang et al., 5 May 2026).

1. Problem Definition and Benchmark Scope

ProgramBench frames the software engineering task as follows: given a compiled executable and associated documentation, an LM-based agent is required to infer, architect, and implement a functionally equivalent codebase. The agent must select the programming language, project layout, build system, modular decomposition, interfaces, and data structures independently. No internal structure, API, or reference implementation is provided; correctness is judged solely on the agent’s solution behaving like the target executable.

Key characteristics:

End-to-End Development: The agent performs a “zero-to-one” software re-creation, not just patching, refactoring, or editing code.
Implementation-Agnostic Evaluation: Any code structure satisfying behavioral requirements passes; solutions in any language are allowed.
Agent Isolation: Agents lack internet access and cannot decompile or wrap the executable; only documentation and observed behaviors inform their solutions.

2. Dataset Construction Methodology

ProgramBench’s dataset comprises 200 tasks automatically extracted from public GitHub repositories. The pipeline features four distinct phases:

Repository Selection: Candidate projects must be written in compiled languages (C, C++, Rust, Go, Java, Haskell), build a standalone executable, and not require online or interactive dependencies.
Executable Creation: A specialized agent automates project compilation. The outcome includes a reproducible build script (compile.sh), logs, and frozen dependencies.
Behavioral Test Generation: Coverage-guided agent-driven fuzzing produces test suites asserting on output, exit codes, and observable side effects, never on implementation internals.
Inference Environment Assembly: The final environment provides the agent with the executable (with execute-only permissions), usage documentation, and required assets, stripping away sources and network capabilities.

Dataset statistics:

Metric	Median	Min	Max
LOC	8,635	212	2.7M
Code files	50	1	5,342
Runtime packages	10	0	113
GitHub stars	2,124	202	79,693
Commits	646	13	145,991
Project age (years)	7.9	0.3	17.8
Language dist.	Rust (33%), Go (29%), C/C++ (33%), Java/Haskell (~1%)

(Yang et al., 5 May 2026)

3. Fuzzing-Based Behavioral Testing

Behavioral testing leverages an agent-in-the-loop, coverage-guided fuzzing architecture. The process proceeds as follows:

Iteration: The agent identifies coverage gaps in the gold executable, proposes new tests targeting uncovered paths, and these are added if they cover new paths and pass a linter.
Coverage Metrics: Coverage cov(T) is the fraction of source lines executed by test suite T; generation halts at τ (typically τ=0.95).
Assertion Linting: Weak tests—e.g., those only asserting on exit codes, substring matches <15 chars, or containing disjunctions or exception swallowing—are rejected or revised.

This framework ensures high-quality, nontrivial, and robust testing of candidate solutions purely on external behavior, not source-level similarity (Yang et al., 5 May 2026).

4. Evaluation Metrics and Experimental Setup

Performance is measured using several metrics:

Per-task Pass Rate: For model M and task t with nₜ tests:

$\mathrm{PR}(M,t) = \frac{1}{n_t}\sum_{i=1}^{n_t} \mathbf{1}(\text{test}_i~\text{passes})$

Aggregate Scores: Mean pass rate $\overline{\mathrm{PR}(M)}$ over 200 tasks.
Resolution Fractions:
- Fully Resolved: Fraction of tasks with $\mathrm{PR}(M,t)=1$ .
- Almost Resolved: Fraction with $\mathrm{PR}(M,t)\ge0.95$ .

Evaluation harness:

Nine state-of-the-art LMs are tested (Claude Opus 4.7/4.6, Sonnet 4.6, Haiku 4.5, Gemini 3.1 Pro/Flash, GPT 5.4/mini).
Agents are isolated in Ubuntu-based containers (20 CPUs, 60GB RAM), receive only documentation and the gold executable, and operate under tight token and time budgets (≤1,000 steps or 6h per task).

Model	Resolved	Almost (≥95%)	Mean API Calls	Mean Cost (USD)
Claude Opus 4.7	0.0%	3.0%	93	3.81
Claude Opus 4.6	0.0%	2.5%	260	11.38
Claude Sonnet 4.6	0.0%	1.6%	475	27.09
Claude Haiku 4.5	0.0%	0.0%	124	0.80
Gemini 3.1 Pro	0.0%	0.0%	94	1.51
Gemini 3 Flash	0.0%	0.0%	89	0.33
GPT 5.4	0.0%	0.0%	16	0.33
GPT 5.4 mini	0.0%	0.0%	18	0.04
GPT 5 mini	0.0%	0.0%	15	0.03

No model fully resolved any task; the peak “almost resolved” is 3.0%, i.e., 6 tasks out of 200 (Yang et al., 5 May 2026).

5. Empirical Findings and Failure Modes

Codebase Structural Analysis

Monolithic Implementations: 85% of model solutions are significantly smaller than references (median 1,173 vs. 3,068 LOC). Median file count is 3 (reference: 15), and most solutions use a flat directory structure.
Function Granularity: Models generate fewer, longer functions (e.g., Sonnet 4.6: 24 functions vs. 44 in reference, average length 35 vs. 24 lines).
Trajectories: GPT-5.4 models tend to output most of the code in one step; Claude models iterate over hundreds of steps, interleaving probing and writing.

Task Difficulty Distribution

Simple CLI Tools (e.g., nnn, fzf, gron) see pass rates of 60–80%.
Complex Systems (FFmpeg, PHP, Typst) typically result in <10% pass rates.

Failure Modes

Architectural Collapses: Agents prefer monolithic, single-file implementations, eschewing modular decomposition, explicit interfaces, or file hierarchies.
Missing Edge Cases: Solutions frequently lack coverage for boundary behaviors and error modes.
Underspecified Resource Handling: Long functions and absence of defensive programming patterns limit robustness.

6. Benchmark Design Philosophy and Recommendations

ProgramBench’s design emphasizes behavioral evaluation, extensibility, and modular, unbiased assessment of agent capabilities:

Implementation-Agnosticism: All correct solutions—regardless of design choices or language—are valid if they pass behavioral tests.
Agent-Driven Test Generation: The coverage- and assertion-focused generation loop ensures test quality and mitigates trivial solution strategies.
Isolation and Anti-Cheating Controls: Strict environment constraints prevent use of source lookup, wrapping, or other shortcut solutions.

The authors recommend several directions for future research:

Explicit Architectural Planning: Prompting agents to plan modules and interfaces prior to coding.
Multi-Agent Decomposition: Tasking specialized sub-agents (planner, coder, tester) with subtasks to tackle project complexity.
Feedback-Integrated Development: Leveraging static analysis and code complexity metrics to encourage modularity.
Iterative Human-In-The-Loop Guidance: Allowing for human feedback prior to code generation.
Richer and More Challenging Specifications: Extending benchmarks with performance and resource constraints and contributions from new domains.

7. Comparative Context and Impact

ProgramBench occupies a distinct niche among software engineering and program synthesis benchmarks:

Compared to micro-benchmarks or patch-oriented benchmarks (e.g., HumanEval, MBPP, SWE-bench), ProgramBench targets the holistic regeneration of full projects, demanding architectural reasoning (Yang et al., 5 May 2026).
ProjDevBench, ProBench, and PBEBench provide complementary measurements targeting end-to-end project completion (from spec to codebase), competitive-programming-level algorithmic reasoning, and fine-grained programming-by-example synthesis, respectively (Lu et al., 2 Feb 2026, Yang et al., 28 Feb 2025, Naik et al., 29 May 2025).
ProgramBench’s methodology serves as a model for large-scale, implementation-agnostic benchmarking using real-world projects, agent-based environment isolation, and behavioral fuzzing for evaluation.

A plausible implication is that as LMs are increasingly deployed as autonomous coding agents, benchmarks such as ProgramBench are essential for exposing deficiencies in system-level reasoning, modular decomposition, and robust software construction, driving both the development and evaluation of next-generation AI programming agents (Yang et al., 5 May 2026).