AgentPack: Collaborative Code-Editing Corpus
- AgentPack is a large-scale, real-world corpus of 1,337,012 co-authored code edits, integrating contributions from agents like Claude Code, OpenAI Codex, and Cursor Agent with human oversight.
- It employs a multi-stage pipeline—comprising event detection, merge filtering, patch extraction, and metadata integration—to ensure high-quality, naturally labeled code-editing examples.
- Fine-tuning experiments with AgentPack demonstrate notable performance gains in LLM-based code editors, emphasizing the value of multi-file edits and complex, agent-generated rationales.
AgentPack is a large-scale corpus comprising 1,337,012 real-world code edits co-authored by software-engineering agents—@@@@2@@@@ (Anthropic), OpenAI Codex, and Cursor Agent—and humans in public GitHub repositories between April 1 and August 15, 2025. The primary objective of AgentPack is to provide high-quality, naturally labeled code-editing exemplars suitable for fine-tuning and analyzing LLMs on code-editing tasks. The dataset uniquely focuses on human–agent collaborative activity filtered via maintainer approval, distinguished by semantically scoped patches with detailed, agent-generated rationales and natural language intentions (Zi et al., 26 Sep 2025).
1. Source Agents and Dataset Scope
AgentPack identifies code-editing events through agent-specific commit and pull request signatures:
- Claude Code: Commits contain the annotation
Co-Authored-By: Claude <[email protected]>. - OpenAI Codex: Pull request descriptions include links to
chatgpt.com/codex/tasks. - Cursor Agent: Commits bear the author line
Cursor Agent <[email protected]>.
The corpus spans 59 GB, drawn from material accepted into default branches ("main" or "master"). The time window commences one week after the Claude Code launch (2025-04-01) and ends on 2025-08-15. This interval captures the initial widespread adoption of these agents in open-source workflows.
2. Identification, Curation, and Quality Control Pipeline
AgentPack employs a multi-stage pipeline for data identification, curation, and quality assurance:
- Event Detection: Public GitHub Archive (GH Archive) events are queried for push and pull_request activity, matching agent-specific signatures.
- Repository Cloning and Merge Filtering: For each repository containing candidate agent–human co-authored edits, a shallow bare clone is performed, retaining only those commits merged into the default branch, thus leveraging the project maintainers' review process as an implicit human-in-the-loop quality control mechanism.
- Patch Extraction and Noise Reduction: The complete git diff (all hunks) is extracted per commit, with patches referencing files under
node_modules/removed to avoid contaminating the corpus with third-party vendor code. - Metadata Integration: Commit metadata—including timestamp, agent label, and original (agent-generated) description—are joined with patch content to form individual AgentPack items.
An illustrative scoring function for prioritizing high-signal commits is defined as
where favors edits affecting fewer files, and rewards message length, with as tunable weights.
3. Adoption Trends and Quantitative Metrics
AgentPack documents rapid adoption of LLM-based agents in software-engineering practice:
- Aggregate Commits Over April–August 2025:
- Claude Code: ~854,946 commits
- Codex: ~372,006 commits
- Cursor Agent: ~110,060 commits
Cumulative commit rate functions track agent-specific uptake, and the adoption ratio is given by
with . Claude Code exhibited the fastest growth (steep increase May–June), peaking mid-May (~10,000 commits/week), followed by a plateau. Codex peaked in June, whereas Cursor Agent usage increased steadily but at a lower rate, suggesting more targeted deployment scenarios.
4. Structural Properties of Edits
AgentPack's code edits display distinct structural characteristics in comparison to historical human-only corpora:
- Median Per-Edit Statistics:
- Files touched: 2
- Patch size: 70 lines (added + removed)
- Hunks per file: 1.5
- Commit message length: 323 characters
Comparative benchmarks include CommitPackFT (single-file edits: patch size 4 lines, message length 43 chars) and CanItEdit (patch size 7 lines, message length 57 chars); thus, AgentPack edits are approximately 10× larger, with commit messages 6–10× longer.
Code-edit complexity is quantified as
AgentPack's complexity distribution is bimodal at approximately 30 and 120, corresponding to a spectrum from rapid single-file fixes to multi-file refactorings.
| Dataset | Median Patch Size | Median Message Length |
|---|---|---|
| AgentPack | 70 lines | 323 chars |
| CommitPackFT | 4 lines | 43 chars |
| CanItEdit | 7 lines | 57 chars |
5. Fine-Tuning Experiments and Benchmarking
A subset of 118,848 Python-only edits (≤4096 tokens, ~120M tokens total) was used to fine-tune DeepSeekCoder 1.3B and 6.7B models from the EditCoder family. The prompt template was:
1 2 3 4 5 6 7 |
## Instruction:
{message}
## Code Before:
{old}
## Code After: |
Training parameters included the AdamW optimizer, learning rate 2e-5, batch size 64, 3 epochs, and cosine decay with 10% warmup. Benchmark evaluations involved HumanEvalFix (bug-fixing) and CanItEdit tasks, using pass@1 (20 samples, temperature=0.2, top-p=0.95).
Results:
| Model | Base HEF | Base CI | EditCoder HEF | EditCoder CI | AgentPack HEF | AgentPack CI | Δ_HEF | Δ_CI |
|---|---|---|---|---|---|---|---|---|
| DeepSeekCoder-1.3B | 0.19 | 0.11 | 0.20 | 0.29 | 0.32 | 0.32 | 0.13 | 0.21 |
| DeepSeekCoder-6.7B | 0.39 | 0.30 | 0.45 | 0.42 | 0.48 | 0.41 | 0.09 | 0.11 |
AgentPack-driven fine-tuning produced significant gains relative to both baseline and existing EditCoder models (, paired bootstrapping). Ablations demonstrate that removing multi-file edits reduced by ≈ 0.04, underscoring the value of complex, multi-file examples.
6. Distinctive Advantages and Limitations
Advantages:
- Rich natural language intent articulation: commit messages in AgentPack are approximately 10× longer than those in human-only datasets.
- Broader operational scope: includes multi-file changes, test additions, refactorings, and documentation edits.
- Human-vetted commit quality: inclusion is restricted to changes merged to mainline branches, leveraging existing project governance processes.
Limitations:
- Absence of original user prompts; only agent-generated rationales are present.
- Uncertainty regarding Cursor Agent's backend model identity due to lack of disclosure.
- Possible human modification of some agent-authored commits post-merge.
- Exclusively public-project data; excludes unmerged or private edits, introducing a potential repository bias.
7. Implications and Future Directions
AgentPack establishes a paradigm for constructing high-quality code-editing datasets grounded in real-world, human-vetted agent–human collaboration. A plausible implication is that the combination of human-in-the-loop quality control, enriched linguistic context, and increased structural scope has direct performance benefits for downstream LLM-based code-editors.
Future avenues include extending coverage to additional programming languages and low-resource ecosystems, reconstructing original prompt–response dialogue contexts, and deploying AgentPack as an RL training environment for software-engineering agents in open-world settings. AgentPack serves as both a benchmark and resource for systematically studying agent integration into software development workflows and informs the next generation of code-editing system development (Zi et al., 26 Sep 2025).