E2EDevBench: End-to-End LLM Dev Evaluation
- E2EDevBench is a comprehensive framework that evaluates LLM-based autonomous agents on realistic, project-level software development from requirements to fully tested code.
- It simulates authentic development workflows by providing only natural-language specifications and using robust, hybrid evaluation protocols to assess requirement fulfillment and planning.
- Empirical results reveal that agents struggle more with requirement comprehension and planning than with code synthesis, emphasizing the need for enhanced verification and traceability.
E2EDevBench is a comprehensive, project-level benchmark and evaluation framework for assessing LLM-based autonomous agents in the context of realistic end-to-end software development. It is designed to address critical limitations of prior benchmarks—such as insufficient challenge, impoverished evaluation metrics, and weak realism—by providing curated tasks, rigorous hybrid evaluation protocols, and controlled experimental designs. E2EDevBench supports reproducible, fine-grained comparisons of LLM-powered development agents, focusing on requirement comprehension, planning, and functional fulfillment in authentic software engineering scenarios.
1. Rationale and Scope
E2EDevBench was introduced to resolve three principal deficiencies prevalent in previous end-to-end software development (E2ESD) benchmarks:
- Challenge Deficit: Existing benchmarks such as SoftwareDev and ProjectDev rely on “toy” tasks with minimal codebases (typically ≈300–400 LOC), failing to capture complexities inherent to real-world projects.
- Evaluation Limitations: Traditionally employed metrics—such as code similarity, absence of TODOs, or non-crashing executables—do not robustly indicate accurate fulfillment of natural-language requirements.
- Workflow Non-Authenticity: Systems are frequently provided with test suites or skeleton code at input (as in DevEval, Commit0), diverging from true end-to-end workflows that start from requirements alone.
E2EDevBench explicitly simulates a realistic software development process by presenting agents solely with a human-refined natural-language requirements document and an empty working directory. It selects medium-complexity open-source projects (average 19.2 files, 2,011 LOC, 119.7 tests) sampled quarterly from recent PyPI releases to ensure relevance and diversity (Zeng et al., 6 Nov 2025).
2. Benchmark Construction and Annotation Workflow
E2EDevBench comprises two core instantiations owing to complementary research efforts (Liu et al., 16 Oct 2025, Zeng et al., 6 Nov 2025): E2EDev (with its semi-automatic BDD annotation and automated harness) and the dynamically curated E2EDevBench for controlled agent experimentation.
Pipeline Stages
- Source-Level Filtering: PyPI BigQuery selects projects with >5 Python files, valid code-comment ratios, and a robust pytest suite.
- LLM-Based Filtering: Candidates exhibiting trivial functionality, OS-specific reliance, or extensive external APIs are excluded by LLM vetting.
- Execution-Based Filtering: Projects are validated for buildability and test reproducibility in sandboxed environments.
- Time-Sliced Sampling, Requirement Generation: Every quarter, 10 new projects are sampled and a natural-language spec is drafted and refined by human experts.
- Human-in-the-Loop Multi-Agent Annotation Framework (HITL-MAA) (E2EDev):
- Test-ID Annotation: LLMs assign unique data-test-id attributes; humans check for logic and name conflicts.
- User Requirement Extraction: LLMs parse UI/event structure; requirements are proposed then refined for clarity.
- BDD Scenario Generation: LLMs produce Gherkin-style scenarios; humans ensure error/edge case coverage.
- Step Definition Implementation: LLMs generate Selenium/Behave Python code; iterative bug-fixing is partly human-in-the-loop.
Empirically, only 20 % of requirements and 50 % of scenarios necessitate human editing; less than 20 % of automation scripts require manual fixes (Liu et al., 16 Oct 2025).
3. Evaluation Framework
E2EDevBench employs a rigorous, hybrid evaluation comprising both functional and requirement-level measurement, with a strong emphasis on reproducibility and objectivity.
Hybrid Assessment Approach
- Automated Test-Case Migration: A “Test Migration Agent” adapts original ground-truth test suites—never released to agents for code generation—to each agent-produced codebase without modifying the code, thereby expanding the validation surface beyond agent-generated tests.
- LLM-Based Requirement Verification: An independent LLM “judge” (Gemini-2.5-Pro) is provided the requirements document, produced code, and test run results; it decides “implemented/not implemented” for each requirement, repeated thrice with only unanimous results accepted.
Core Metrics
| Metric | Definition |
|---|---|
| Pass-Rate | |
| Impl-Rate | |
| Req.Acc | |
| Test.Acc | |
| Balanced | |
| Cost (USD) | |
| Footprint | |
| Duration (s) | Wall-clock runtime per project |
Empirical testing confirms high alignment between LLM and human judgment (Pearson ; agreement 76 – 84 %) (Zeng et al., 6 Nov 2025).
4. Agent Frameworks, Experimental Design, and Comparative Results
E2EDevBench supports controlled evaluation of diverse agent architectures by standardizing experimental conditions and LLM backbones.
Agent Workflows
- SDAgent-Single: Monolithic; the agent handles the full loop from requirements to code, tests, and debugging.
- SDAgent-DT (Developer → Tester): A two-agent choreography in which one agent implements the code, and a separate agent generates and runs tests, debugging as needed.
- SDAgent-DDT (Designer → Developer → Tester): A three-agent pipeline introducing a Designer who authors a high-level plan, a Developer who produces code from this plan, and a Tester responsible for validation and repair.
Experimental Setup
- Unified SWE-Agent-based codebase; LLMs: Gemini-2.5-Pro and Gemini-2.5-Flash (temperature 0.2, max 200 steps/task); evaluation repeated over 50 projects sampled from Jan 2024–Jan 2025, with each run under fixed hardware and prompt protocols (Zeng et al., 6 Nov 2025).
- For E2EDev, established frameworks include Vanilla LLM (one-shot prompting), GPT-Engineer, and multi-agent frameworks such as Self-Collaboration, MapCoder, ChatDev, and MetaGPT evaluated using GPT-4o and Qwen family LLMs (Liu et al., 16 Oct 2025).
Key Findings (Selected Results)
| Framework/Agent | Impl-Rate/Req.Acc (%) | Pass-Rate/Test.Acc (%) | Cost (USD) | Duration (s) |
|---|---|---|---|---|
| SDAgent-DT + Gemini-Pro | 53.50 | 79.95 | 7.05 | - |
| GPT-Engineer | 50.37 | 66.68 | 0.0198 | 21 |
| MapCoder | 47.92 | 63.89 | 0.1091 | 93 |
| MetaGPT | ≈0 | 0.18 | 0.0951 | 66 |
- No evaluated agent—across both one-shot and multi-agent strategies—achieves greater than 60 % on either requirement or implementation fulfillment. Even GPT-4o and Gemini-2.5-Pro, at their best, satisfy approximately half of all requirements.
- Single-agent (e.g. GPT-Engineer) and Developer–Tester agent workflows consistently outperform more complex multi-agent (e.g. DDT, MapCoder, ChatDev) pipelines in both cost and reliability.
- The resource overhead for multi-agent workflows is high (4–21M tokens, $4.85–$7.05 per repo), with marginal or negative returns. Multi-agent systems incur 5–15× more prompts/tokens for comparable or worse coverage (Liu et al., 16 Oct 2025, Zeng et al., 6 Nov 2025).
5. Error Taxonomy and Bottlenecks
Systematic manual annotation of 1,000 unfulfilled requirements yields the following profile:
| Error Type | Proportion (%) |
|---|---|
| Missing Component/Feature | 32.4 |
| Incomplete Implementation | 18.1 |
| Incorrect Implementation | 34.1 |
| Dependent/Upstream Failure | 15.3 |
Root-cause taxonomy further reveals:
- Task Planning (55.8 %): Omissions (27.9 %), misinterpretation (22.2 %), poor architectural decisions (5.6 %), predominantly relate to requirement analysis and planning, not code synthesis.
- Task Execution (38.6 %): Shallow implementations (12.5 %), dependency issues (21 %), core LLM/agent capability failures (3.2 %), context loss (1.9 %).
- Task Verification (5.7 %): Self-test coverage is insufficient.
The key bottleneck in agent workflows is not code generation capability but requirement omission/misunderstanding and insufficient systematic self-verification. A plausible implication is that enhancing structured requirement parsing, traceability, and plan↔code cross-checking could yield significant progress (Zeng et al., 6 Nov 2025).
6. Usage, Accessibility, and Community Impact
E2EDevBench and E2EDev are publicly available:
- Source code and the fully annotated benchmark: https://github.com/SCUNLP/E2EDev
- Dataset via HuggingFace: https://huggingface.co/datasets/GuanZhiZhao/E2EDev
Practical evaluation protocol:
- Generate a codebase under a working directory.
- Access corresponding
features/andsteps/from the E2EDev repository. - Run the
behavecommand with ChromeDriver for headless testing. - Review the Behave summary for per-requirement pass/fail outcomes.
E2EDevBench thus enables rigorous, automated, reproducible, and project-level benchmarking targeting the actual fulfillment of fine-grained user requirements. This suggests its adoption may drive future benchmarks away from superficial metrics and toward operationally meaningful criteria.
7. Research Implications and Directions
The empirical findings and methodological advances of E2EDevBench inform several research priorities:
- Workflow Orchestration: Two-agent Developer–Tester stratification achieves a performance–cost balance, with additional workflow granularity (e.g., adding a Designer) potentially introducing harmful bias due to rigid adherence to initial blueprints.
- Enhancing Requirement Coverage: Omission/misinterpretation of requirements remains the dominant failure driver; promising strategies include strengthened requirement analysis modules, specification→code traceability, and explicit LLM-based plan verification sub-agents.
- Self-Verification: Embedding termination checks (“Have you covered every spec bullet?”) and automated cross-referencing between plan and output may mitigate “premature submission.”
- Cost-Effective Ensembles: Best-of-N approaches (majority vote, low-cost runner sub-sampling) could leverage variance in agent performance while containing resource demands.
In summary, E2EDevBench constitutes a challenging, realistic, and methodologically robust foundation for measuring and improving LLM-based agents in end-to-end software development. Quantitative and error analyses indicate that requirement comprehension and cross-validated planning, rather than generative code synthesis alone, are the key obstacles toward autonomous E2ESD agents (Liu et al., 16 Oct 2025, Zeng et al., 6 Nov 2025).