DevAI: Benchmarking Agentic Code Generation

Updated 14 January 2026

DevAI is a purpose-built benchmark designed for evaluating agentic code-generation systems across realistic AI-development pipelines with multi-step dependency requirements.
It comprises 55 tasks spanning diverse subfields such as supervised learning, natural language processing, and computer vision, each annotated with hierarchical and structured requirements.
Its evaluation metrics—including independent and dependency-aware requirement coverage and task solve rate—enable nuanced assessment of multi-stage reasoning and coding performance.

DevAI is a purpose-built benchmark designed to facilitate rigorous evaluation of agentic code-generation systems within realistic AI-development pipelines. It comprises 55 tasks that span core subfields of machine learning and data science, each annotated with detailed hierarchical requirements and optional preferences. Its structure, annotation scheme, and evaluation metrics enable nuanced measurement of agents’ multi-stage reasoning, coding ability, and adherence to complex dependencies. DevAI is publicly available and openly licensed for redistribution and adaptation (Zhuge et al., 2024).

1. Dataset Composition and Scope

DevAI consists of 55 AI-development tasks representing a broad spectrum of real-world applications, selected to cover supervised learning, reinforcement learning, computer vision, NLP, generative modeling, and audio processing. Each task is tagged categorically (e.g., “Supervised Learning,” “Classification,” “Image Processing”) and is formulated to require multi-step pipelines involving data acquisition, preprocessing, model definition, training, evaluation, artifact management, and sometimes deployment interfaces.

Python is the standard programming language throughout, and tasks demand nuanced usage of libraries including PyTorch, TensorFlow, scikit-learn, Flask, and Streamlit. The computational scale is modest, but the complexity arises from interdependent requirements and realistic toolchain emulation, such as specific file/directory layouts and interactions with platforms like Kaggle and HuggingFace.

2. Task Specification and Structure

Each DevAI task is defined by three principal components:

Query: A free-form paragraph detailing the user's goal, often hyperlinked to relevant publications or datasets.
Requirements: Binary criteria (exists/non-empty), each assigned a unique integer ID, textual description, prerequisite requirement IDs (forming a directed acyclic graph), and categorical label (Dataset/Environment, Preprocessing, Model Definition, Metrics, Visualization, HCI/API, Other).
Preferences: Optional, non-binary criteria (totaling 125) expressing desirable but non-essential behaviors.

Agents are presented with the query, a 30-minute time constraint, and instructions to persist all code, data, figures, and models in strict folder structures (e.g., src/, results/, models/saved_models/). Constraint prompts further guide artifact placement and discourage trivial solutions.

Example Task Configuration (excerpt)

Component	Example (Style Transfer)
Query	Build a PyTorch perceptual-loss style-transfer pipeline mixing Mona Lisa and Starry Night; save results, log time
Requirements	R0: Download Mona Lisa → `data/content.jpg`<br>R1: Download Starry Night → `data/style.jpg`<br>R2: Implement model in `src/model.py`<br>R3–R6: Save stylized images, expose hyperparameters, log time, save intermediates (with dependencies)
Preferences	System adapts to unfamiliar tools, optional platform usage

3. Annotation Scheme

DevAI's annotation protocol draws on established AI-workflow methodologies (KDD, CRISP-DM, AutoML). The 365 requirements are hierarchically partitioned into seven sequential phases:

Data acquisition & environment
Preprocessing & feature extraction
Model definition & training
Model saving & snapshotting
Performance metrics recording
Visualization & reporting
Human-computer interface / APIs

Each requirement is binary, minimizing ambiguity and drift. Dependency links enforce logical ordering and reflect realistic project constraints—downstream requirements cannot be met unless all prerequisites are fulfilled, promoting emergent difficulty through directed acyclic graphs.

Preferences were intentionally broad and aspirational, capturing "nice-to-have" user signals. Two rounds of expert review ensured clarity and robustness across all annotations.

4. Data Representation and Storage

DevAI is organized on disk as a directory of JSON files, with each task housed in its own file. The canonical schema is as follows:

task_XX.json:
- name: string
- query: string
- tags: array[string]
- is_kaggle_api_needed: boolean
- is_training_needed: boolean
- is_web_navigation_needed: boolean
- requirements: array of objects
- requirement_id: int
- prerequisites: array[int]
- criteria: string
- category: enum
- satisfied: null
- preferences: array of objects
- preference_id: int
- criteria: string
- satisfied: null

A separate constraints.json defines global artifact location rules. Trajectories log agents’ stepwise internal state, shell actions, environment outputs, and resource usage (token-cost, time) in JSON arrays, enabling gray-box evaluation.

5. Evaluation Metrics

DevAI employs three primary agentic performance metrics:

Requirement Coverage (independent):

$\text{Score}^I = \frac{\text{Number of requirements marked satisfied (ignoring prerequisites)}}{365}$

Requirement Coverage (dependency-aware): For $r_i \in \{0,1\}$ indicating satisfaction, and $P(i)$ the prerequisites:

$r_i^D = r_i \cdot \prod_{j\in P(i)} r_j$

$\text{Score}^D = \frac{\sum_{i=1}^{365} r_i^D}{365}$

Task Solve Rate:

$\text{Task Solve Rate} = \frac{1}{55} \sum_{t=1}^{55} 1\left[\forall i \in R_t : r_i^D = 1 \right]$

These metrics facilitate granular distinction between superficial and dependency-respecting task completion.

Coverage Example

Task	Requirements	Independent Coverage	Dependency-aware Coverage
Style Transfer	7	6/7 (85.7%)	5/7 (71.4%)
Sales Forecast	7	6/7 (85.7%)	5/7 (71.4%)

6. Illustrative Tasks and Agent Outputs

Sample tasks exemplify DevAI's capacity to model realistic agentic workflows:

Style Transfer: Agents must download canonical images, implement a perceptual loss pipeline in PyTorch, log operational metrics, and expose stylistic parameters.
Sales Forecasting: Requires automated data loading from Kaggle, sequential LSTM modeling, model persistence, visual output generation, and interactive HTML reporting.

Agent outputs are assessed for compliance with both requirement satisfaction and directory placement constraints. Coverage computations clarify partial versus fully realized pipelines.

7. Access and Licensing

DevAI is distributed under a permissive MIT-style license. The benchmark, evaluation harness, and sample codebases are accessible via HuggingFace and GitHub, allowing unrestricted use, redistribution, and adaptation contingent on attribution.

A plausible implication is that DevAI's openly available structure and detailed annotation regime position it as a robust foundation for benchmarking and developing next-generation agentic systems and their evaluators, such as the Agent-as-a-Judge framework (Zhuge et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

Agent-as-a-Judge: Evaluate Agents with Agents (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DevAI Dataset.