AIDev Dataset: Autonomous Code Agents

Updated 10 November 2025

AIDev dataset is a large-scale empirical dataset capturing autonomous coding agents' PR contributions with detailed metrics on code changes, review latency, and complexity.
It systematically aggregates over 456K pull requests from five state-of-the-art agents across 61K repositories, offering both quantitative and qualitative insights.
The dataset underpins SE 3.0 research by facilitating comparisons of agent and human behaviors, thus enhancing our understanding of human–AI collaboration in software engineering.

AIDev is a large-scale empirical dataset designed to capture and characterize the activities of autonomous coding agents—goal-driven artificial intelligence systems that autonomously contribute pull requests (PRs) within the open-source software development ecosystem. By systematically collecting and annotating 456,535 PRs from five state-of-the-art agents (OpenAI Codex, Devin, GitHub Copilot, Cursor, and Claude Code) across 61,453 GitHub repositories, AIDev forms a foundational resource for research in real-world, “agentic” software engineering and the emerging paradigm of SE 3.0. The dataset enables rigorous quantitative and qualitative study of agent behaviors, human–agent collaboration, and code quality outcomes, supporting research directions beyond static benchmarks.

1. Dataset Structure and Agent Coverage

AIDev encompasses data from five prominent autonomous code-generation and modification agents, each responsible for varying numbers of PRs and repositories within the data collection window. Table 1 summarizes the high-level scope:

Agent	PRs	Developers	Repositories
OpenAI Codex	411,621	41,619	53,702
Devin (Cognition Labs)	24,893	2,897	3,857
GitHub Copilot	16,531	1,916	3,097
Cursor (Yage AI)	1,981	753	828
Claude Code	1,509	585	645

Total coverage includes 47,303 unique developers and 61,453 repositories. The dataset also defines AIDev-pop, a filtered subset with PRs limited to repositories with ≥500 stargazers (7,122 PRs, 1,240 developers, 856 repositories), and a human-authored PR baseline in those projects (6,628 PRs), enabling matched comparative analysis (Li et al., 20 Jul 2025).

2. Data Schema and Metadata

AIDev is distributed as a set of CSV tables and Python scripts in relational schema form. Key tables and representative fields include:

pull_request: pr_id, repo_id, author_id, is_agentic, agent_name, state, created_at, merged_at, additions, deletions, files_changed, cyclomatic_delta (ΔCC), loc_delta (ΔLOC), merged
repository: repo_id, name, primary_language, stargazers_count, forks_count, created_at
user: user_id, login, type (User/Bot), company, location, followers
pr_timeline: event_id, pr_id, event_type (opened, review_requested, etc.), actor_id, actor_type, created_at
pr_reviews: review_id, pr_id, reviewer_id, state (approved, changes_requested, commented), submitted_at
pr_comments: comment_id, pr_id, author_id, body, created_at
pr_commits: commit_id, pr_id, sha, author_date, commit_date
pr_commit_details: detail_id, commit_id, file_path, additions, deletions, patch
related_issue: pr_id, issue_id, relation_type (closes, references)
issue: issue_id, repo_id, title, state, created_at, closed_at

All timestamps use ISO 8601, and all code-change-related metrics are reported as integers/floats according to semantic type.

3. Code Complexity and Review Metrics

AIDev quantifies structural and process-oriented dimensions of code contributions using standard static analysis and process event modeling:

Cyclomatic Complexity Change: $\Delta CC = CC_1 - CC_0$ , where $CC_1$ and $CC_0$ are the sum across all functions post- and pre-PR, respectively (applying the McCabe metric).
LOC Change: $\Delta LOC = LOC_1 - LOC_0$ , with LOC as the count of non-blank, non-comment lines modified.
Acceptance (Merge) Rate:

$r = \frac{\text{number of merged PRs}}{\text{total PRs}}$

Mean Review Latency:

$T_{\mathrm{review}} = \frac{1}{N} \sum_{i=1}^N (t_{\mathrm{merged},i} - t_{\mathrm{opened},i})$

with $t_{\mathrm{opened}}, t_{\mathrm{merged}}$ the PR’s open and merge timestamps.

These metrics facilitate precise comparison of agent behavior, code changes, and review processes, both between agents and against human baselines.

4. Empirical Findings

Analysis of AIDev-pop and matched human PRs reveals several salient empirical patterns:

PR Size (lines added/deleted, mean per PR):
- OpenAI Codex: ~32 added, 8 deleted
- Devin: ~45 added, 12 deleted
- GitHub Copilot: ~27 added, 24 deleted
Acceptance Rates (merge rate):
- Human PRs: 76.8%
- OpenAI Codex: 64%
- Devin: 49%
- GitHub Copilot: 35%
- Cursor: 51.4%
- Claude Code: 52.5%
Median Review Latencies (accepted PRs, hours):
- Humans: 3.9
- OpenAI Codex: 0.3 (18 min)
- Devin: 2.2
- Copilot: 17.2
- Cursor: 2.4
- Claude Code: 6.9
Speed vs. Acceptance Trade-Off: Codex PRs are reviewed quickly but accepted less frequently. Copilot PRs experience longer acceptance times but are also rejected quickly.
Structural Simplicity: In a case study, only 9.1% of agentic PRs changed cyclomatic complexity vs. 23.3% for human PRs, indicating agents tend to submit smaller, less structurally invasive changes.

A plausible implication is that current autonomous agents prioritize rapid, low-risk modifications over complex refactoring or architectural contributions.

5. Methodological Implications and Use Cases

AIDev enables numerous research directions:

Benchmarking: Supports moving beyond artificial evaluations (e.g., SWE-bench) by enabling "living" assessment based on real PR integration and review.
Agent Readiness: Direct empirical comparison of agent performance on feature, bug-fix, and documentation tasks informs tool selection.
Collaboration Modeling: Rich timeline, review, and comment data allow reconstruction of interaction networks and sentiment analysis in human–AI code review.
Optimization and Latency Analysis: GitHub Actions logs facilitate identification of agent and infrastructure-driven sources of process delay.
Governance and Accountability: Authorship metadata supports traceability and authorship standards, e.g., via "Co-Authored-By: Claude" commit labeling.
Future Research Directions: Proposals include quantifying "trust gap" via review depth and merge likelihood, longitudinal analysis of code quality, and extraction of multi-step agent reasoning from PR descriptions.

This breadth of use cases distinguishes AIDev from purely synthetic or task-oriented benchmarks.

6. Access, Extensibility, and Licensing

AIDev is available at https://github.com/SAILResearch/AI_Teammates_in_SE3. The repository includes:

data/: CSV files for all schema tables
schema.sql: DDL for relational schema
notebooks/: Jupyter notebooks for analysis (acceptance, latency, complexity, language prevalence)
scripts/: Python scripts for data loading, join operations, and filtering (including AIDev-pop subset creation)

The data collection pipeline supports incremental extension through continued GitHub REST API queries. Researchers may augment the dataset with custom metadata (e.g., test coverage, static analysis results) and cross-link to other systems (Jira, CI/CD data, bug reports) for expanded investigation.

The dataset is released under a permissive open license. By enabling detailed, evidence-based investigations at scale, AIDev is positioned as an empirical foundation for the study of SE 3.0 workflows and the evolving dynamics of human–AI code collaboration (Li et al., 20 Jul 2025).

PDF Markdown Chat (Pro)

References (1)

The Rise of AI Teammates in Software Engineering (SE) 3.0: How Autonomous Coding Agents Are Reshaping Software Engineering (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to AIDev Dataset.