Create a Video View Paper

ProgramBench: Can Language Models Rebuild Programs From Scratch?

ProgramBench introduces a rigorous new benchmark that challenges language models to reconstruct entire software projects from scratch, given only an executable and documentation. Unlike prior benchmarks focused on localized tasks like bug fixes, ProgramBench requires holistic software design, architectural reasoning, and module decomposition without any structural hints or language constraints. Evaluating nine frontier models on 200 real-world tasks spanning diverse languages and domains, the benchmark reveals a stark capability gap: no model successfully reconstructs a single complete task, and most solutions exhibit monolithic, non-modular code that diverges sharply from human design patterns.

Script

State-of-the-art language models can fix bugs and add features, but can they design and build an entire software project from nothing but a binary and its documentation? ProgramBench puts this question to the test.

The benchmark transforms 200 real-world GitHub repositories into reconstruction challenges. The pipeline compiles the original binary, generates comprehensive behavioral tests through coverage-guided fuzzing, then strips away all source code and build artifacts, leaving only the executable and docs.

The results are unequivocal. Nine frontier models were evaluated, and not one successfully reconstructed a single complete task. Even partial solutions are rare, with the best model reaching 95 percent test pass rate on only 3 percent of tasks.

When models do produce working code, the structure tells a revealing story. Agent-generated programs collapse into monolithic single files with fewer, longer functions, and they consistently use only a third the code volume of human implementations, suggesting missing robustness and modularity.

Model strategies diverge sharply in how they approach the problem. Some agents incrementally write, compile, and debug through hundreds of edit cycles, while others emit nearly the entire codebase in a single bulk write with minimal subsequent iteration or exploration.

ProgramBench reveals that generating functions and fixing bugs is a fundamentally different challenge from designing complete software systems. The benchmark sets a high bar for future research in autonomous software engineering, and you can explore the full findings and create your own video explainers at EmergentMind.com.