AgentDAM: Scalable LLM Code Benchmarking

Updated 3 February 2026

AgentDAM is an integrated framework that automates project-level code benchmarking by using LLM-driven agents to generate, annotate, and evaluate product requirement documents.
It employs a five-stage workflow that generates PRDs, refines test plans, and executes multimodal tests including integration and shell-based checks, reducing reliance on high-cost experts.
PRDBench, developed with AgentDAM, benchmarks 50 real-world Python projects across 20 domains and demonstrates high evaluation efficiency with strong alignment to human scoring.

AgentDAM refers to an agent-driven annotation and evaluation methodology that enables scalable, low-cost, and realistic benchmarking of LLM code agents at the project level. It operationalizes agent-centered workflows—code agents generate, annotate, and judge tasks—substantially reducing the dependency on high-cost domain experts and rigid unit-test metrics. The principal instantiation of AgentDAM in code generation benchmarking is PRDBench, a suite built to capture the true demands and complexities of end-to-end software development across diverse domains (Fu et al., 28 Oct 2025).

1. Core Definition and Motivation

AgentDAM is an integrated framework for both constructing project-level code benchmarks and automating their evaluation via advanced LLM-driven agents. Both task annotation (through structured Product Requirement Documents, PRDs) and solution assessment (through agent-as-a-judge protocols) are delegated to code agents, limiting human involvement to lightweight interface checks. The explicit motivations are:

Annotation Cost Reduction: Traditional benchmarks, such as PaperBench, necessitate extensive expert input, incurring multi-day annotation cycles per task.
Metric Flexibility: Existing benchmarks mainly rely on unit tests; AgentDAM targets richer metrics including integration, shell-based, and file-differencing tests.
Pace with Advancement: Code agents are evolving more rapidly than human-maintained evaluation pipelines can accommodate, necessitating scalable and adaptable benchmarking ecosystems.

AgentDAM achieves scalable dataset curation, multi-modal testing, and lowers annotation cost to approximately eight hours per task, typically achievable by undergraduate-level annotators (Fu et al., 28 Oct 2025).

2. Annotation and Evaluation Pipeline

The construction pipeline proceeds through five tightly controlled stages:

PRD & Test-Plan Initialization: An LLM (e.g., GPT-4.1) drafts a PRD with Overview, Functional Requirements, and Data Requirements sections. It then elaborates a test plan using the Arrange-Act-Assert format.
Scaffold & Criteria Generation: The agent expands the PRD into module/interface scaffolds and a detailed metrics scheme specifying test interfaces and tangible outputs.
Human Inspection: Annotators execute the generated scaffolding/tests, verifying alignment with PRD-defined interfaces and outputs.
Agent-Based Iterative Refinement: Discrepancies prompt annotator feedback; the agent revises artifacts accordingly. This loop iterates until passing.
Scaffold Removal: The final benchmark archives only the PRD, the distilled criteria scheme, test artifacts, and a reference implementation. All scaffolding is excised.

The formal pseudocode for this process in LaTeX algorithmic notation is provided below:

\begin{algorithm}[h]
\caption{Agent-Driven Task Annotation}
\begin{algorithmic}[1]
\Require SeedTasks %%%%0%%%%, humanAnnotators %%%%1%%%%, codeAgent %%%%2%%%%
\Ensure PRDBench tasks %%%%3%%%%
\State %%%%4%%%%
\For{each %%%%5%%%% satisfying \text{Python-implementable} and \text{public-data}}
  \State %%%%6%%%%
  \State %%%%7%%%%
  \State %%%%8%%%%
  \State %%%%9%%%%
  \Repeat
    \State %%%%10%%%%
    \If{%%%%11%%%%}
      \State %%%%12%%%%
    \EndIf
  \Until{%%%%13%%%%}
  \State Remove(\text{Scaffold})
  \State %%%%14%%%%
\EndFor
\end{algorithmic}
\end{algorithm}

This process standardizes benchmarks and aligns evaluation criteria strictly with PRD specifications.

3. PRDBench Dataset Structure and Content

PRDBench, constructed using AgentDAM, comprises:

50 real-world Python projects
20 application domains (e.g., data processing, ML, web scraping)
1,262 scoring points, distributed as:
- 409 Unit Test points
- 729 Shell Interaction points
- 124 File Comparison points

Each task features a structured PRD (average length $\overline{\ell_{\rm PRD}}=105.22$ lines) and a criteria scheme encoded as a JSON plan, which for every metric specifies the name (e.g., “3.2 Unit Test – Generate Huffman Codes”), AAA description, expected commands of function signatures, and the required outputs. The dataset ensures domain coverage across all 20 targeted domains and exhibits diverse scaffold and task sizes.

4. Evaluation Methodology: Agent-as-a-Judge Paradigm

The Agent-as-a-Judge protocol underpins autonomous, flexible evaluation. The EvalAgent is a lightweight LLM-powered agent equipped with:

File read/write operations
Shell execution
Multimodal inputs (with GPT-4o)
Internal judge tools that can process user-simulated inputs

EvalAgent executes each test from the criteria scheme, compares generated outputs to expectations, and produces structured JSON assessment reports.

Formal metrics include:

Overall Correctness Score:

$\mathrm{Score} = \frac{\sum_{m\in\mathcal M} s_m}{\sum_{m\in\mathcal M} k_m} \in [0,1]$

with $s_m$ as score and $k_m$ as the maximum for metric $m$ .

Coverage:

$\mathrm{Coverage} = \frac{|\{m: s_m \text{ executed}\}|}{|\mathcal M|}$

Robustness:

$\mathrm{Robustness} = 1 - \frac{\#\text{failures due to agent hallucinations}}{\#\text{total tests}}$

Metric Modalities: Metrics extend to unit tests (pytest), shell interaction (CLI outputs vs. reference), and file comparison (artifact diffing), capturing the complexities of true software QA pipelines including integration and end-to-end testing.

5. Experimental Validation and Agent Performance

A suite of both minimal (ADK framework) and commercial agents were evaluated:

Agent	DEV Pass Rate (%)	DEBUG Pass Rate (%)
GPT-5	55.8	60.2
Claude	45.5	49.1
Gemini	14.3	16.0
Qwen3-Coder	37.6	47.0
CodeX	56.2	50.2
Claude Code	36.6	45.5
Gemini CLI	16.4	21.6
Qwen Code	39.6	35.7

Key observations:

Pass rates demonstrate strong dependence on underlying LLM quality.
Framework adaptation impacts results: some agent architectures amplify or degrade performance on debugging tasks, particularly if agent-driven refactoring impacts interface preservation.
EvalAgent achieves a full-task evaluation in ∼425 s at a cost of $2.68, compared with 0.5–1 hour for human annotators.
EvalAgent–human scoring alignment is high (81.6% perfect match across 282 cases).

6. Practical Adoption and Future Development

AgentDAM’s guidelines for effective adoption emphasize:

Proactive seed task filtering for language/data compliance.
Utilization of strong LLMs for PRD/test-plan authoring.
Reliance on scaffolding expansion tools for standardized annotation.
Limiting human involvement to essential interface checks; all criterion writing and metric elaboration is agent-driven.
Maintaining multi-modal test coverage to mirror real-world software complexity.
Automated scoring via EvalAgent for cost and scalability advantages.

Open challenges and future research priorities include:

Extending framework to non-Python ecosystems (e.g., Java, JavaScript).
Incorporating additional dimensions such as performance and security evaluation.
Enhancing robustness of agent-based judging, particularly to chain-of-thought drift.
Investigating multi-agent judge architectures and adversarial alignments.
Dynamically evolving task suites as underlying code agents improve, enabling perpetual benchmarking.

A plausible implication is that as the capabilities of LLM-based agents increase, future evaluation frameworks will need to incorporate both greater automation and more sophisticated, diverse evaluation criteria, including adversarial and context-sensitive judgment mechanisms (Fu et al., 28 Oct 2025).

7. Context Within Benchmarking Ecosystem

PRDBench, enabled by AgentDAM, specifically addresses the limitations of prior benchmarks such as high expert annotation cost and inflexible, test-only scoring metrics. In contrast to agent–expert collaboration frameworks in FDABench (for data agents over heterogeneous data) (Wang et al., 2 Sep 2025), AgentDAM leverages nearly end-to-end agent annotation and evaluation, aligning evaluation cost and granularity with the actual development pace and demands of modern LLM-powered code-generation systems.

By aligning benchmark construction and evaluation with automated agent tools, AgentDAM and PRDBench present a model for future project-level benchmarks required to keep pace with foundation-model-driven advances in software engineering research and practice.

Markdown Report Issue Upgrade to Chat

References (2)

Automatically Benchmarking LLM Code Agents through Agent-Driven Annotation and Evaluation (2025)

FDABench: A Benchmark for Data Agents on Analytical Queries over Heterogeneous Data (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AgentDAM Benchmark.