Papers
Topics
Authors
Recent
Search
2000 character limit reached

ProgramBench Benchmark Framework

Updated 8 May 2026
  • ProgramBench is a benchmark framework that evaluates LM agents' ability to reconstruct full software projects using documentation and a reference executable.
  • It employs a four-phase methodology including repository selection, executable creation, fuzzing-based test generation, and environment assembly.
  • Empirical findings reveal that current models struggle with architectural planning and modular decomposition, underscoring challenges in holistic software construction.

ProgramBench is a benchmark framework that measures the ability of software engineering agents, specifically large LMs, to reconstruct complete software projects from scratch using only documentation and access to a reference executable. Designed to move beyond traditional code-generation micro-benchmarks, ProgramBench targets holistic software construction, demanding that agents make architectural, organizational, and design choices necessary for real-world software engineering tasks (Yang et al., 5 May 2026).

1. Problem Definition and Benchmark Scope

ProgramBench frames the software engineering task as follows: given a compiled executable and associated documentation, an LM-based agent is required to infer, architect, and implement a functionally equivalent codebase. The agent must select the programming language, project layout, build system, modular decomposition, interfaces, and data structures independently. No internal structure, API, or reference implementation is provided; correctness is judged solely on the agent’s solution behaving like the target executable.

Key characteristics:

  • End-to-End Development: The agent performs a “zero-to-one” software re-creation, not just patching, refactoring, or editing code.
  • Implementation-Agnostic Evaluation: Any code structure satisfying behavioral requirements passes; solutions in any language are allowed.
  • Agent Isolation: Agents lack internet access and cannot decompile or wrap the executable; only documentation and observed behaviors inform their solutions.

2. Dataset Construction Methodology

ProgramBench’s dataset comprises 200 tasks automatically extracted from public GitHub repositories. The pipeline features four distinct phases:

  1. Repository Selection: Candidate projects must be written in compiled languages (C, C++, Rust, Go, Java, Haskell), build a standalone executable, and not require online or interactive dependencies.
  2. Executable Creation: A specialized agent automates project compilation. The outcome includes a reproducible build script (compile.sh), logs, and frozen dependencies.
  3. Behavioral Test Generation: Coverage-guided agent-driven fuzzing produces test suites asserting on output, exit codes, and observable side effects, never on implementation internals.
  4. Inference Environment Assembly: The final environment provides the agent with the executable (with execute-only permissions), usage documentation, and required assets, stripping away sources and network capabilities.

Dataset statistics:

Metric Median Min Max
LOC 8,635 212 2.7M
Code files 50 1 5,342
Runtime packages 10 0 113
GitHub stars 2,124 202 79,693
Commits 646 13 145,991
Project age (years) 7.9 0.3 17.8
Language dist. Rust (33%), Go (29%), C/C++ (33%), Java/Haskell (~1%)

(Yang et al., 5 May 2026)

3. Fuzzing-Based Behavioral Testing

Behavioral testing leverages an agent-in-the-loop, coverage-guided fuzzing architecture. The process proceeds as follows:

  1. Iteration: The agent identifies coverage gaps in the gold executable, proposes new tests targeting uncovered paths, and these are added if they cover new paths and pass a linter.
  2. Coverage Metrics: Coverage cov(T) is the fraction of source lines executed by test suite T; generation halts at τ (typically τ=0.95).
  3. Assertion Linting: Weak tests—e.g., those only asserting on exit codes, substring matches <15 chars, or containing disjunctions or exception swallowing—are rejected or revised.

This framework ensures high-quality, nontrivial, and robust testing of candidate solutions purely on external behavior, not source-level similarity (Yang et al., 5 May 2026).

4. Evaluation Metrics and Experimental Setup

Performance is measured using several metrics:

  • Per-task Pass Rate: For model M and task t with nₜ tests:

PR(M,t)=1nti=1nt1(testi passes)\mathrm{PR}(M,t) = \frac{1}{n_t}\sum_{i=1}^{n_t} \mathbf{1}(\text{test}_i~\text{passes})

  • Aggregate Scores: Mean pass rate PR(M)\overline{\mathrm{PR}(M)} over 200 tasks.
  • Resolution Fractions:
    • Fully Resolved: Fraction of tasks with PR(M,t)=1\mathrm{PR}(M,t)=1.
    • Almost Resolved: Fraction with PR(M,t)0.95\mathrm{PR}(M,t)\ge0.95.

Evaluation harness:

  • Nine state-of-the-art LMs are tested (Claude Opus 4.7/4.6, Sonnet 4.6, Haiku 4.5, Gemini 3.1 Pro/Flash, GPT 5.4/mini).
  • Agents are isolated in Ubuntu-based containers (20 CPUs, 60GB RAM), receive only documentation and the gold executable, and operate under tight token and time budgets (≤1,000 steps or 6h per task).
Model Resolved Almost (≥95%) Mean API Calls Mean Cost (USD)
Claude Opus 4.7 0.0% 3.0% 93 3.81
Claude Opus 4.6 0.0% 2.5% 260 11.38
Claude Sonnet 4.6 0.0% 1.6% 475 27.09
Claude Haiku 4.5 0.0% 0.0% 124 0.80
Gemini 3.1 Pro 0.0% 0.0% 94 1.51
Gemini 3 Flash 0.0% 0.0% 89 0.33
GPT 5.4 0.0% 0.0% 16 0.33
GPT 5.4 mini 0.0% 0.0% 18 0.04
GPT 5 mini 0.0% 0.0% 15 0.03

No model fully resolved any task; the peak “almost resolved” is 3.0%, i.e., 6 tasks out of 200 (Yang et al., 5 May 2026).

5. Empirical Findings and Failure Modes

Codebase Structural Analysis

  • Monolithic Implementations: 85% of model solutions are significantly smaller than references (median 1,173 vs. 3,068 LOC). Median file count is 3 (reference: 15), and most solutions use a flat directory structure.
  • Function Granularity: Models generate fewer, longer functions (e.g., Sonnet 4.6: 24 functions vs. 44 in reference, average length 35 vs. 24 lines).
  • Trajectories: GPT-5.4 models tend to output most of the code in one step; Claude models iterate over hundreds of steps, interleaving probing and writing.

Task Difficulty Distribution

  • Simple CLI Tools (e.g., nnn, fzf, gron) see pass rates of 60–80%.
  • Complex Systems (FFmpeg, PHP, Typst) typically result in <10% pass rates.

Failure Modes

  • Architectural Collapses: Agents prefer monolithic, single-file implementations, eschewing modular decomposition, explicit interfaces, or file hierarchies.
  • Missing Edge Cases: Solutions frequently lack coverage for boundary behaviors and error modes.
  • Underspecified Resource Handling: Long functions and absence of defensive programming patterns limit robustness.

6. Benchmark Design Philosophy and Recommendations

ProgramBench’s design emphasizes behavioral evaluation, extensibility, and modular, unbiased assessment of agent capabilities:

  • Implementation-Agnosticism: All correct solutions—regardless of design choices or language—are valid if they pass behavioral tests.
  • Agent-Driven Test Generation: The coverage- and assertion-focused generation loop ensures test quality and mitigates trivial solution strategies.
  • Isolation and Anti-Cheating Controls: Strict environment constraints prevent use of source lookup, wrapping, or other shortcut solutions.

The authors recommend several directions for future research:

  1. Explicit Architectural Planning: Prompting agents to plan modules and interfaces prior to coding.
  2. Multi-Agent Decomposition: Tasking specialized sub-agents (planner, coder, tester) with subtasks to tackle project complexity.
  3. Feedback-Integrated Development: Leveraging static analysis and code complexity metrics to encourage modularity.
  4. Iterative Human-In-The-Loop Guidance: Allowing for human feedback prior to code generation.
  5. Richer and More Challenging Specifications: Extending benchmarks with performance and resource constraints and contributions from new domains.

7. Comparative Context and Impact

ProgramBench occupies a distinct niche among software engineering and program synthesis benchmarks:

  • Compared to micro-benchmarks or patch-oriented benchmarks (e.g., HumanEval, MBPP, SWE-bench), ProgramBench targets the holistic regeneration of full projects, demanding architectural reasoning (Yang et al., 5 May 2026).
  • ProjDevBench, ProBench, and PBEBench provide complementary measurements targeting end-to-end project completion (from spec to codebase), competitive-programming-level algorithmic reasoning, and fine-grained programming-by-example synthesis, respectively (Lu et al., 2 Feb 2026, Yang et al., 28 Feb 2025, Naik et al., 29 May 2025).
  • ProgramBench’s methodology serves as a model for large-scale, implementation-agnostic benchmarking using real-world projects, agent-based environment isolation, and behavioral fuzzing for evaluation.

A plausible implication is that as LMs are increasingly deployed as autonomous coding agents, benchmarks such as ProgramBench are essential for exposing deficiencies in system-level reasoning, modular decomposition, and robust software construction, driving both the development and evaluation of next-generation AI programming agents (Yang et al., 5 May 2026).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ProgramBench.