Papers
Topics
Authors
Recent
2000 character limit reached

AIDev Dataset: Autonomous Code Agents

Updated 10 November 2025
  • AIDev dataset is a large-scale empirical dataset capturing autonomous coding agents' PR contributions with detailed metrics on code changes, review latency, and complexity.
  • It systematically aggregates over 456K pull requests from five state-of-the-art agents across 61K repositories, offering both quantitative and qualitative insights.
  • The dataset underpins SE 3.0 research by facilitating comparisons of agent and human behaviors, thus enhancing our understanding of human–AI collaboration in software engineering.

AIDev is a large-scale empirical dataset designed to capture and characterize the activities of autonomous coding agents—goal-driven artificial intelligence systems that autonomously contribute pull requests (PRs) within the open-source software development ecosystem. By systematically collecting and annotating 456,535 PRs from five state-of-the-art agents (OpenAI Codex, Devin, GitHub Copilot, Cursor, and Claude Code) across 61,453 GitHub repositories, AIDev forms a foundational resource for research in real-world, “agentic” software engineering and the emerging paradigm of SE 3.0. The dataset enables rigorous quantitative and qualitative paper of agent behaviors, human–agent collaboration, and code quality outcomes, supporting research directions beyond static benchmarks.

1. Dataset Structure and Agent Coverage

AIDev encompasses data from five prominent autonomous code-generation and modification agents, each responsible for varying numbers of PRs and repositories within the data collection window. Table 1 summarizes the high-level scope:

Agent PRs Developers Repositories
OpenAI Codex 411,621 41,619 53,702
Devin (Cognition Labs) 24,893 2,897 3,857
GitHub Copilot 16,531 1,916 3,097
Cursor (Yage AI) 1,981 753 828
Claude Code 1,509 585 645

Total coverage includes 47,303 unique developers and 61,453 repositories. The dataset also defines AIDev-pop, a filtered subset with PRs limited to repositories with ≥500 stargazers (7,122 PRs, 1,240 developers, 856 repositories), and a human-authored PR baseline in those projects (6,628 PRs), enabling matched comparative analysis (Li et al., 20 Jul 2025).

2. Data Schema and Metadata

AIDev is distributed as a set of CSV tables and Python scripts in relational schema form. Key tables and representative fields include:

  • pull_request: pr_id, repo_id, author_id, is_agentic, agent_name, state, created_at, merged_at, additions, deletions, files_changed, cyclomatic_delta (ΔCC), loc_delta (ΔLOC), merged
  • repository: repo_id, name, primary_language, stargazers_count, forks_count, created_at
  • user: user_id, login, type (User/Bot), company, location, followers
  • pr_timeline: event_id, pr_id, event_type (opened, review_requested, etc.), actor_id, actor_type, created_at
  • pr_reviews: review_id, pr_id, reviewer_id, state (approved, changes_requested, commented), submitted_at
  • pr_comments: comment_id, pr_id, author_id, body, created_at
  • pr_commits: commit_id, pr_id, sha, author_date, commit_date
  • pr_commit_details: detail_id, commit_id, file_path, additions, deletions, patch
  • related_issue: pr_id, issue_id, relation_type (closes, references)
  • issue: issue_id, repo_id, title, state, created_at, closed_at

All timestamps use ISO 8601, and all code-change-related metrics are reported as integers/floats according to semantic type.

3. Code Complexity and Review Metrics

AIDev quantifies structural and process-oriented dimensions of code contributions using standard static analysis and process event modeling:

  • Cyclomatic Complexity Change: ΔCC=CC1CC0\Delta CC = CC_1 - CC_0, where CC1CC_1 and CC0CC_0 are the sum across all functions post- and pre-PR, respectively (applying the McCabe metric).
  • LOC Change: ΔLOC=LOC1LOC0\Delta LOC = LOC_1 - LOC_0, with LOC as the count of non-blank, non-comment lines modified.
  • Acceptance (Merge) Rate:

r=number of merged PRstotal PRsr = \frac{\text{number of merged PRs}}{\text{total PRs}}

  • Mean Review Latency:

Treview=1Ni=1N(tmerged,itopened,i)T_{\mathrm{review}} = \frac{1}{N} \sum_{i=1}^N (t_{\mathrm{merged},i} - t_{\mathrm{opened},i})

with topened,tmergedt_{\mathrm{opened}}, t_{\mathrm{merged}} the PR’s open and merge timestamps.

These metrics facilitate precise comparison of agent behavior, code changes, and review processes, both between agents and against human baselines.

4. Empirical Findings

Analysis of AIDev-pop and matched human PRs reveals several salient empirical patterns:

  • PR Size (lines added/deleted, mean per PR):
    • OpenAI Codex: ~32 added, 8 deleted
    • Devin: ~45 added, 12 deleted
    • GitHub Copilot: ~27 added, 24 deleted
  • Acceptance Rates (merge rate):
    • Human PRs: 76.8%
    • OpenAI Codex: 64%
    • Devin: 49%
    • GitHub Copilot: 35%
    • Cursor: 51.4%
    • Claude Code: 52.5%
  • Median Review Latencies (accepted PRs, hours):
    • Humans: 3.9
    • OpenAI Codex: 0.3 (18 min)
    • Devin: 2.2
    • Copilot: 17.2
    • Cursor: 2.4
    • Claude Code: 6.9
  • Speed vs. Acceptance Trade-Off: Codex PRs are reviewed quickly but accepted less frequently. Copilot PRs experience longer acceptance times but are also rejected quickly.
  • Structural Simplicity: In a case paper, only 9.1% of agentic PRs changed cyclomatic complexity vs. 23.3% for human PRs, indicating agents tend to submit smaller, less structurally invasive changes.

A plausible implication is that current autonomous agents prioritize rapid, low-risk modifications over complex refactoring or architectural contributions.

5. Methodological Implications and Use Cases

AIDev enables numerous research directions:

  • Benchmarking: Supports moving beyond artificial evaluations (e.g., SWE-bench) by enabling "living" assessment based on real PR integration and review.
  • Agent Readiness: Direct empirical comparison of agent performance on feature, bug-fix, and documentation tasks informs tool selection.
  • Collaboration Modeling: Rich timeline, review, and comment data allow reconstruction of interaction networks and sentiment analysis in human–AI code review.
  • Optimization and Latency Analysis: GitHub Actions logs facilitate identification of agent and infrastructure-driven sources of process delay.
  • Governance and Accountability: Authorship metadata supports traceability and authorship standards, e.g., via "Co-Authored-By: Claude" commit labeling.
  • Future Research Directions: Proposals include quantifying "trust gap" via review depth and merge likelihood, longitudinal analysis of code quality, and extraction of multi-step agent reasoning from PR descriptions.

This breadth of use cases distinguishes AIDev from purely synthetic or task-oriented benchmarks.

6. Access, Extensibility, and Licensing

AIDev is available at https://github.com/SAILResearch/AI_Teammates_in_SE3. The repository includes:

  • data/: CSV files for all schema tables
  • schema.sql: DDL for relational schema
  • notebooks/: Jupyter notebooks for analysis (acceptance, latency, complexity, language prevalence)
  • scripts/: Python scripts for data loading, join operations, and filtering (including AIDev-pop subset creation)

The data collection pipeline supports incremental extension through continued GitHub REST API queries. Researchers may augment the dataset with custom metadata (e.g., test coverage, static analysis results) and cross-link to other systems (Jira, CI/CD data, bug reports) for expanded investigation.

The dataset is released under a permissive open license. By enabling detailed, evidence-based investigations at scale, AIDev is positioned as an empirical foundation for the paper of SE 3.0 workflows and the evolving dynamics of human–AI code collaboration (Li et al., 20 Jul 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to AIDev Dataset.