AIDev Dataset: Autonomous Code Agents
- AIDev dataset is a large-scale empirical dataset capturing autonomous coding agents' PR contributions with detailed metrics on code changes, review latency, and complexity.
- It systematically aggregates over 456K pull requests from five state-of-the-art agents across 61K repositories, offering both quantitative and qualitative insights.
- The dataset underpins SE 3.0 research by facilitating comparisons of agent and human behaviors, thus enhancing our understanding of human–AI collaboration in software engineering.
AIDev is a large-scale empirical dataset designed to capture and characterize the activities of autonomous coding agents—goal-driven artificial intelligence systems that autonomously contribute pull requests (PRs) within the open-source software development ecosystem. By systematically collecting and annotating 456,535 PRs from five state-of-the-art agents (OpenAI Codex, Devin, GitHub Copilot, Cursor, and Claude Code) across 61,453 GitHub repositories, AIDev forms a foundational resource for research in real-world, “agentic” software engineering and the emerging paradigm of SE 3.0. The dataset enables rigorous quantitative and qualitative paper of agent behaviors, human–agent collaboration, and code quality outcomes, supporting research directions beyond static benchmarks.
1. Dataset Structure and Agent Coverage
AIDev encompasses data from five prominent autonomous code-generation and modification agents, each responsible for varying numbers of PRs and repositories within the data collection window. Table 1 summarizes the high-level scope:
| Agent | PRs | Developers | Repositories |
|---|---|---|---|
| OpenAI Codex | 411,621 | 41,619 | 53,702 |
| Devin (Cognition Labs) | 24,893 | 2,897 | 3,857 |
| GitHub Copilot | 16,531 | 1,916 | 3,097 |
| Cursor (Yage AI) | 1,981 | 753 | 828 |
| Claude Code | 1,509 | 585 | 645 |
Total coverage includes 47,303 unique developers and 61,453 repositories. The dataset also defines AIDev-pop, a filtered subset with PRs limited to repositories with ≥500 stargazers (7,122 PRs, 1,240 developers, 856 repositories), and a human-authored PR baseline in those projects (6,628 PRs), enabling matched comparative analysis (Li et al., 20 Jul 2025).
2. Data Schema and Metadata
AIDev is distributed as a set of CSV tables and Python scripts in relational schema form. Key tables and representative fields include:
- pull_request: pr_id, repo_id, author_id, is_agentic, agent_name, state, created_at, merged_at, additions, deletions, files_changed, cyclomatic_delta (ΔCC), loc_delta (ΔLOC), merged
- repository: repo_id, name, primary_language, stargazers_count, forks_count, created_at
- user: user_id, login, type (User/Bot), company, location, followers
- pr_timeline: event_id, pr_id, event_type (opened, review_requested, etc.), actor_id, actor_type, created_at
- pr_reviews: review_id, pr_id, reviewer_id, state (approved, changes_requested, commented), submitted_at
- pr_comments: comment_id, pr_id, author_id, body, created_at
- pr_commits: commit_id, pr_id, sha, author_date, commit_date
- pr_commit_details: detail_id, commit_id, file_path, additions, deletions, patch
- related_issue: pr_id, issue_id, relation_type (closes, references)
- issue: issue_id, repo_id, title, state, created_at, closed_at
All timestamps use ISO 8601, and all code-change-related metrics are reported as integers/floats according to semantic type.
3. Code Complexity and Review Metrics
AIDev quantifies structural and process-oriented dimensions of code contributions using standard static analysis and process event modeling:
- Cyclomatic Complexity Change: , where and are the sum across all functions post- and pre-PR, respectively (applying the McCabe metric).
- LOC Change: , with LOC as the count of non-blank, non-comment lines modified.
- Acceptance (Merge) Rate:
- Mean Review Latency:
with the PR’s open and merge timestamps.
These metrics facilitate precise comparison of agent behavior, code changes, and review processes, both between agents and against human baselines.
4. Empirical Findings
Analysis of AIDev-pop and matched human PRs reveals several salient empirical patterns:
- PR Size (lines added/deleted, mean per PR):
- OpenAI Codex: ~32 added, 8 deleted
- Devin: ~45 added, 12 deleted
- GitHub Copilot: ~27 added, 24 deleted
- Acceptance Rates (merge rate):
- Human PRs: 76.8%
- OpenAI Codex: 64%
- Devin: 49%
- GitHub Copilot: 35%
- Cursor: 51.4%
- Claude Code: 52.5%
- Median Review Latencies (accepted PRs, hours):
- Humans: 3.9
- OpenAI Codex: 0.3 (18 min)
- Devin: 2.2
- Copilot: 17.2
- Cursor: 2.4
- Claude Code: 6.9
- Speed vs. Acceptance Trade-Off: Codex PRs are reviewed quickly but accepted less frequently. Copilot PRs experience longer acceptance times but are also rejected quickly.
- Structural Simplicity: In a case paper, only 9.1% of agentic PRs changed cyclomatic complexity vs. 23.3% for human PRs, indicating agents tend to submit smaller, less structurally invasive changes.
A plausible implication is that current autonomous agents prioritize rapid, low-risk modifications over complex refactoring or architectural contributions.
5. Methodological Implications and Use Cases
AIDev enables numerous research directions:
- Benchmarking: Supports moving beyond artificial evaluations (e.g., SWE-bench) by enabling "living" assessment based on real PR integration and review.
- Agent Readiness: Direct empirical comparison of agent performance on feature, bug-fix, and documentation tasks informs tool selection.
- Collaboration Modeling: Rich timeline, review, and comment data allow reconstruction of interaction networks and sentiment analysis in human–AI code review.
- Optimization and Latency Analysis: GitHub Actions logs facilitate identification of agent and infrastructure-driven sources of process delay.
- Governance and Accountability: Authorship metadata supports traceability and authorship standards, e.g., via "Co-Authored-By: Claude" commit labeling.
- Future Research Directions: Proposals include quantifying "trust gap" via review depth and merge likelihood, longitudinal analysis of code quality, and extraction of multi-step agent reasoning from PR descriptions.
This breadth of use cases distinguishes AIDev from purely synthetic or task-oriented benchmarks.
6. Access, Extensibility, and Licensing
AIDev is available at https://github.com/SAILResearch/AI_Teammates_in_SE3. The repository includes:
data/: CSV files for all schema tablesschema.sql: DDL for relational schemanotebooks/: Jupyter notebooks for analysis (acceptance, latency, complexity, language prevalence)scripts/: Python scripts for data loading, join operations, and filtering (including AIDev-pop subset creation)
The data collection pipeline supports incremental extension through continued GitHub REST API queries. Researchers may augment the dataset with custom metadata (e.g., test coverage, static analysis results) and cross-link to other systems (Jira, CI/CD data, bug reports) for expanded investigation.
The dataset is released under a permissive open license. By enabling detailed, evidence-based investigations at scale, AIDev is positioned as an empirical foundation for the paper of SE 3.0 workflows and the evolving dynamics of human–AI code collaboration (Li et al., 20 Jul 2025).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free