DevAI: Benchmarking Agentic Code Generation
- DevAI is a purpose-built benchmark designed for evaluating agentic code-generation systems across realistic AI-development pipelines with multi-step dependency requirements.
- It comprises 55 tasks spanning diverse subfields such as supervised learning, natural language processing, and computer vision, each annotated with hierarchical and structured requirements.
- Its evaluation metrics—including independent and dependency-aware requirement coverage and task solve rate—enable nuanced assessment of multi-stage reasoning and coding performance.
DevAI is a purpose-built benchmark designed to facilitate rigorous evaluation of agentic code-generation systems within realistic AI-development pipelines. It comprises 55 tasks that span core subfields of machine learning and data science, each annotated with detailed hierarchical requirements and optional preferences. Its structure, annotation scheme, and evaluation metrics enable nuanced measurement of agents’ multi-stage reasoning, coding ability, and adherence to complex dependencies. DevAI is publicly available and openly licensed for redistribution and adaptation (Zhuge et al., 2024).
1. Dataset Composition and Scope
DevAI consists of 55 AI-development tasks representing a broad spectrum of real-world applications, selected to cover supervised learning, reinforcement learning, computer vision, NLP, generative modeling, and audio processing. Each task is tagged categorically (e.g., “Supervised Learning,” “Classification,” “Image Processing”) and is formulated to require multi-step pipelines involving data acquisition, preprocessing, model definition, training, evaluation, artifact management, and sometimes deployment interfaces.
Python is the standard programming language throughout, and tasks demand nuanced usage of libraries including PyTorch, TensorFlow, scikit-learn, Flask, and Streamlit. The computational scale is modest, but the complexity arises from interdependent requirements and realistic toolchain emulation, such as specific file/directory layouts and interactions with platforms like Kaggle and HuggingFace.
2. Task Specification and Structure
Each DevAI task is defined by three principal components:
- Query: A free-form paragraph detailing the user's goal, often hyperlinked to relevant publications or datasets.
- Requirements: Binary criteria (exists/non-empty), each assigned a unique integer ID, textual description, prerequisite requirement IDs (forming a directed acyclic graph), and categorical label (Dataset/Environment, Preprocessing, Model Definition, Metrics, Visualization, HCI/API, Other).
- Preferences: Optional, non-binary criteria (totaling 125) expressing desirable but non-essential behaviors.
Agents are presented with the query, a 30-minute time constraint, and instructions to persist all code, data, figures, and models in strict folder structures (e.g., src/, results/, models/saved_models/). Constraint prompts further guide artifact placement and discourage trivial solutions.
Example Task Configuration (excerpt)
| Component | Example (Style Transfer) |
|---|---|
| Query | Build a PyTorch perceptual-loss style-transfer pipeline mixing Mona Lisa and Starry Night; save results, log time |
| Requirements | R0: Download Mona Lisa → data/content.jpg<br>R1: Download Starry Night → data/style.jpg<br>R2: Implement model in src/model.py<br>R3–R6: Save stylized images, expose hyperparameters, log time, save intermediates (with dependencies) |
| Preferences | System adapts to unfamiliar tools, optional platform usage |
3. Annotation Scheme
DevAI's annotation protocol draws on established AI-workflow methodologies (KDD, CRISP-DM, AutoML). The 365 requirements are hierarchically partitioned into seven sequential phases:
- Data acquisition & environment
- Preprocessing & feature extraction
- Model definition & training
- Model saving & snapshotting
- Performance metrics recording
- Visualization & reporting
- Human-computer interface / APIs
Each requirement is binary, minimizing ambiguity and drift. Dependency links enforce logical ordering and reflect realistic project constraints—downstream requirements cannot be met unless all prerequisites are fulfilled, promoting emergent difficulty through directed acyclic graphs.
Preferences were intentionally broad and aspirational, capturing "nice-to-have" user signals. Two rounds of expert review ensured clarity and robustness across all annotations.
4. Data Representation and Storage
DevAI is organized on disk as a directory of JSON files, with each task housed in its own file. The canonical schema is as follows:
- task_XX.json:
name: stringquery: stringtags: array[string]is_kaggle_api_needed: booleanis_training_needed: booleanis_web_navigation_needed: booleanrequirements: array of objectsrequirement_id: intprerequisites: array[int]criteria: stringcategory: enumsatisfied: nullpreferences: array of objectspreference_id: intcriteria: stringsatisfied: null
A separate constraints.json defines global artifact location rules. Trajectories log agents’ stepwise internal state, shell actions, environment outputs, and resource usage (token-cost, time) in JSON arrays, enabling gray-box evaluation.
5. Evaluation Metrics
DevAI employs three primary agentic performance metrics:
- Requirement Coverage (independent):
- Requirement Coverage (dependency-aware): For indicating satisfaction, and the prerequisites:
- Task Solve Rate:
These metrics facilitate granular distinction between superficial and dependency-respecting task completion.
Coverage Example
| Task | Requirements | Independent Coverage | Dependency-aware Coverage |
|---|---|---|---|
| Style Transfer | 7 | 6/7 (85.7%) | 5/7 (71.4%) |
| Sales Forecast | 7 | 6/7 (85.7%) | 5/7 (71.4%) |
6. Illustrative Tasks and Agent Outputs
Sample tasks exemplify DevAI's capacity to model realistic agentic workflows:
- Style Transfer: Agents must download canonical images, implement a perceptual loss pipeline in PyTorch, log operational metrics, and expose stylistic parameters.
- Sales Forecasting: Requires automated data loading from Kaggle, sequential LSTM modeling, model persistence, visual output generation, and interactive HTML reporting.
Agent outputs are assessed for compliance with both requirement satisfaction and directory placement constraints. Coverage computations clarify partial versus fully realized pipelines.
7. Access and Licensing
DevAI is distributed under a permissive MIT-style license. The benchmark, evaluation harness, and sample codebases are accessible via HuggingFace and GitHub, allowing unrestricted use, redistribution, and adaptation contingent on attribution.
A plausible implication is that DevAI's openly available structure and detailed annotation regime position it as a robust foundation for benchmarking and developing next-generation agentic systems and their evaluators, such as the Agent-as-a-Judge framework (Zhuge et al., 2024).