AgentDAM: Scalable LLM Code Benchmarking
- AgentDAM is an integrated framework that automates project-level code benchmarking by using LLM-driven agents to generate, annotate, and evaluate product requirement documents.
- It employs a five-stage workflow that generates PRDs, refines test plans, and executes multimodal tests including integration and shell-based checks, reducing reliance on high-cost experts.
- PRDBench, developed with AgentDAM, benchmarks 50 real-world Python projects across 20 domains and demonstrates high evaluation efficiency with strong alignment to human scoring.
AgentDAM refers to an agent-driven annotation and evaluation methodology that enables scalable, low-cost, and realistic benchmarking of LLM code agents at the project level. It operationalizes agent-centered workflows—code agents generate, annotate, and judge tasks—substantially reducing the dependency on high-cost domain experts and rigid unit-test metrics. The principal instantiation of AgentDAM in code generation benchmarking is PRDBench, a suite built to capture the true demands and complexities of end-to-end software development across diverse domains (Fu et al., 28 Oct 2025).
1. Core Definition and Motivation
AgentDAM is an integrated framework for both constructing project-level code benchmarks and automating their evaluation via advanced @@@@1@@@@. Both task annotation (through structured Product Requirement Documents, PRDs) and solution assessment (through agent-as-a-judge protocols) are delegated to code agents, limiting human involvement to lightweight interface checks. The explicit motivations are:
- Annotation Cost Reduction: Traditional benchmarks, such as PaperBench, necessitate extensive expert input, incurring multi-day annotation cycles per task.
- Metric Flexibility: Existing benchmarks mainly rely on unit tests; AgentDAM targets richer metrics including integration, shell-based, and file-differencing tests.
- Pace with Advancement: Code agents are evolving more rapidly than human-maintained evaluation pipelines can accommodate, necessitating scalable and adaptable benchmarking ecosystems.
AgentDAM achieves scalable dataset curation, multi-modal testing, and lowers annotation cost to approximately eight hours per task, typically achievable by undergraduate-level annotators (Fu et al., 28 Oct 2025).
2. Annotation and Evaluation Pipeline
The construction pipeline proceeds through five tightly controlled stages:
- PRD & Test-Plan Initialization: An LLM (e.g., GPT-4.1) drafts a PRD with Overview, Functional Requirements, and Data Requirements sections. It then elaborates a test plan using the Arrange-Act-Assert format.
- Scaffold & Criteria Generation: The agent expands the PRD into module/interface scaffolds and a detailed metrics scheme specifying test interfaces and tangible outputs.
- Human Inspection: Annotators execute the generated scaffolding/tests, verifying alignment with PRD-defined interfaces and outputs.
- Agent-Based Iterative Refinement: Discrepancies prompt annotator feedback; the agent revises artifacts accordingly. This loop iterates until passing.
- Scaffold Removal: The final benchmark archives only the PRD, the distilled criteria scheme, test artifacts, and a reference implementation. All scaffolding is excised.
The formal pseudocode for this process in LaTeX algorithmic notation is provided below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
\begin{algorithm}[h]
\caption{Agent-Driven Task Annotation}
\begin{algorithmic}[1]
\Require SeedTasks %%%%0%%%%, humanAnnotators %%%%1%%%%, codeAgent %%%%2%%%%
\Ensure PRDBench tasks %%%%3%%%%
\State %%%%4%%%%
\For{each %%%%5%%%% satisfying \text{Python-implementable} and \text{public-data}}
\State %%%%6%%%%
\State %%%%7%%%%
\State %%%%8%%%%
\State %%%%9%%%%
\Repeat
\State %%%%10%%%%
\If{%%%%11%%%%}
\State %%%%12%%%%
\EndIf
\Until{%%%%13%%%%}
\State Remove(\text{Scaffold})
\State %%%%14%%%%
\EndFor
\end{algorithmic}
\end{algorithm} |
This process standardizes benchmarks and aligns evaluation criteria strictly with PRD specifications.
3. PRDBench Dataset Structure and Content
PRDBench, constructed using AgentDAM, comprises:
- 50 real-world Python projects
- 20 application domains (e.g., data processing, ML, web scraping)
- 1,262 scoring points, distributed as:
- 409 Unit Test points
- 729 Shell Interaction points
- 124 File Comparison points
Each task features a structured PRD (average length lines) and a criteria scheme encoded as a JSON plan, which for every metric specifies the name (e.g., “3.2 Unit Test – Generate Huffman Codes”), AAA description, expected commands of function signatures, and the required outputs. The dataset ensures domain coverage across all 20 targeted domains and exhibits diverse scaffold and task sizes.
4. Evaluation Methodology: Agent-as-a-Judge Paradigm
The Agent-as-a-Judge protocol underpins autonomous, flexible evaluation. The EvalAgent is a lightweight LLM-powered agent equipped with:
- File read/write operations
- Shell execution
- Multimodal inputs (with GPT-4o)
- Internal judge tools that can process user-simulated inputs
EvalAgent executes each test from the criteria scheme, compares generated outputs to expectations, and produces structured JSON assessment reports.
Formal metrics include:
- Overall Correctness Score:
with as score and as the maximum for metric .
- Coverage:
- Robustness:
- Metric Modalities: Metrics extend to unit tests (pytest), shell interaction (CLI outputs vs. reference), and file comparison (artifact diffing), capturing the complexities of true software QA pipelines including integration and end-to-end testing.
5. Experimental Validation and Agent Performance
A suite of both minimal (ADK framework) and commercial agents were evaluated:
| Agent | DEV Pass Rate (%) | DEBUG Pass Rate (%) |
|---|---|---|
| GPT-5 | 55.8 | 60.2 |
| Claude | 45.5 | 49.1 |
| Gemini | 14.3 | 16.0 |
| Qwen3-Coder | 37.6 | 47.0 |
| CodeX | 56.2 | 50.2 |
| Claude Code | 36.6 | 45.5 |
| Gemini CLI | 16.4 | 21.6 |
| Qwen Code | 39.6 | 35.7 |
Key observations:
- Pass rates demonstrate strong dependence on underlying LLM quality.
- Framework adaptation impacts results: some agent architectures amplify or degrade performance on debugging tasks, particularly if agent-driven refactoring impacts interface preservation.
- EvalAgent achieves a full-task evaluation in ∼425 s at a cost of $2.68, compared with 0.5–1 hour for human annotators.
- EvalAgent–human scoring alignment is high (81.6% perfect match across 282 cases).
6. Practical Adoption and Future Development
AgentDAM’s guidelines for effective adoption emphasize:
- Proactive seed task filtering for language/data compliance.
- Utilization of strong LLMs for PRD/test-plan authoring.
- Reliance on scaffolding expansion tools for standardized annotation.
- Limiting human involvement to essential interface checks; all criterion writing and metric elaboration is agent-driven.
- Maintaining multi-modal test coverage to mirror real-world software complexity.
- Automated scoring via EvalAgent for cost and scalability advantages.
Open challenges and future research priorities include:
- Extending framework to non-Python ecosystems (e.g., Java, JavaScript).
- Incorporating additional dimensions such as performance and security evaluation.
- Enhancing robustness of agent-based judging, particularly to chain-of-thought drift.
- Investigating multi-agent judge architectures and adversarial alignments.
- Dynamically evolving task suites as underlying code agents improve, enabling perpetual benchmarking.
A plausible implication is that as the capabilities of LLM-based agents increase, future evaluation frameworks will need to incorporate both greater automation and more sophisticated, diverse evaluation criteria, including adversarial and context-sensitive judgment mechanisms (Fu et al., 28 Oct 2025).
7. Context Within Benchmarking Ecosystem
PRDBench, enabled by AgentDAM, specifically addresses the limitations of prior benchmarks such as high expert annotation cost and inflexible, test-only scoring metrics. In contrast to agent–expert collaboration frameworks in FDABench (for data agents over heterogeneous data) (Wang et al., 2 Sep 2025), AgentDAM leverages nearly end-to-end agent annotation and evaluation, aligning evaluation cost and granularity with the actual development pace and demands of modern LLM-powered code-generation systems.
By aligning benchmark construction and evaluation with automated agent tools, AgentDAM and PRDBench present a model for future project-level benchmarks required to keep pace with foundation-model-driven advances in software engineering research and practice.