Papers
Topics
Authors
Recent
Search
2000 character limit reached

Agentic Pull Requests in Autonomous Software Contributions

Updated 9 January 2026
  • Agentic pull requests are autonomous, end-to-end code contributions by AI agents that plan, code, validate, and submit changes without human prompting.
  • They integrate multi-step reasoning with evolving metrics in testing, dependency management, refactoring, and security to transform modern software development.
  • Empirical studies reveal significant impacts on software quality, including higher test inclusion rates, increased code churn, and distinct review dynamics that inform best practices.

Agentic pull requests (PRs) are autonomous, end-to-end code contributions authored by AI coding agents and submitted to collaborative version control platforms such as GitHub. Distinguished from human-driven PRs by their independent planning, code transformation, validation, and submission, agentic PRs manifest a distinct paradigm in contemporary software engineering, where AI teammates execute software development tasks with minimal human intervention. These workflows are characterized by multi-step agentic reasoning, stochastic trajectories, and evolving process-centric metrics. The empirical characterization of agentic PRs reveals substantial implications across software quality dimensions, including testing, refactoring, dependency hygiene, security, performance, trust, and review dynamics.

1. Formal Definitions and Empirical Scope

An agentic pull request is any GitHub PR whose commits and diffs are generated end-to-end by an autonomous coding agent, such as Claude Code, OpenAI Codex, Devin, Cursor, or GitHub Copilot (Twist, 12 Dec 2025, Watanabe et al., 18 Sep 2025, Horikawa et al., 6 Nov 2025, Haque et al., 7 Jan 2026). The agent plans, writes, and submits code changes—spanning imports, dependency manifest updates, tests, documentation, refactorings, and configuration—without stepwise human prompting or edit. The AIDev dataset is the canonical resource, encompassing over 930,000 PRs from more than 116,000 repos and supporting statistically robust comparative studies across agents, languages, and task types. Agentic PRs are identified by metadata, author field, and absence of preceding human commits, enforcing formal autonomy (Liu et al., 2 Dec 2025, Siddiq et al., 1 Jan 2026).

2. Testing Practices and Metrics in Agentic Pull Requests

Testing is a critical aspect of agent-driven workflows. The prevalence of test code, timing of test introduction, and the review dynamics of test-inclusive PRs are rigorously quantified within the AIDev-pop subset (Haque et al., 7 Jan 2026). The principal metrics are:

  • Test Inclusion Rate:

test_inclusion_rate=# PRs with test code# total agentic PRs\text{test\_inclusion\_rate} = \frac{\#\,\text{PRs with test code}}{\#\,\text{total agentic PRs}}

This rate grew from 31% (Jan 2025) to 52% (Jul 2025) overall, with agent-specific divergence (e.g., Claude: 37% → 55%, Codex: 31% → 58%, Cursor: 14% → 23%, Devin: ~31% flat).

  • Code Churn:

For each PR, Churn(PR)=∑c∈PR(additionsc+deletionsc)\text{Churn(PR)} = \sum_{c \in PR} (\text{additions}_c + \text{deletions}_c). Test PRs consistently exhibit higher code churn than non-test PRs (medians: 133–1,736 LOC vs. 39–183 LOC).

  • Test-to-Code Churn Ratio:

Rtc=test churnnon-test churnR_{tc} = \frac{\text{test churn}}{\text{non-test churn}}. Copilot exhibits near parity (0.87), whereas Claude and Cursor are more production-centric (0.42).

  • Turnaround Time:

Turnaround(PR)=tclosed_at(PR)−tcreated_at(PR)\text{Turnaround(PR)} = t_{closed\_at}(PR) - t_{created\_at}(PR). Test PRs require longer review cycles (up to ~38 h median for Devin), non-test PRs complete faster (mostly <7 h).

  • Merge Rate:

merge_rate=# merged PRs# closed PRs\text{merge\_rate} = \frac{\#\,\text{merged PRs}}{\#\,\text{closed PRs}}. Merge likelihood is broadly stable between test and non-test PRs across agents, except Devin (test PRs merge less often).

Tests are introduced predominantly at PR inception, but ~30% arise post-initial commit, often necessitating human revision—indicative of an evolving collaborative feedback loop. The systematic upward trend in test adoption and its correlation with PR size and review time reflect the maturation of agentic testing practices.

3. Library Usage, Dependency Hygiene, and Supply-Chain Security

Agentic PRs are characterized by ecosystem-aware import strategies and cautious approach to dependency expansion (Twist, 12 Dec 2025, Singla et al., 1 Jan 2026). Metrics include:

  • Library Import Rate:

rimport=NimportNr_{import} = \frac{N_{import}}{N}

(29.5% of PRs add at least one import).

  • New Dependency Introduction Rate:

rnew=NnewNr_{new} = \frac{N_{new}}{N}

(1.3% of PRs modify dependency manifests).

  • Version Specification Rate:

rversion=NversionNnewr_{version} = \frac{N_{version}}{N_{new}}

(75% of new manifest edits specify an explicit version, far outstripping direct LLM prompting standards).

Agents select from a high diversity of libraries (>3,900 distinct in Python, JS, Go, C#), and preference context-appropriate utility (React for JS, pytest for Python). However, agentically authored dependency changes have increased net vulnerability exposure—2.46% select known-vulnerable versions, with 36.8% of remediations demanding major version jumps (disruptive) versus human rates of 1.64% vulnerable, 12.9% major jumps (Singla et al., 1 Jan 2026). Practical mitigation requires PR-time vulnerability screening, registry-aware guardrails, and explicit advisory integration.

4. Refactoring, Code Quality, and Maintenance Dynamics

Refactoring is prevalent and intentional in agentic PRs. Across Java PRs, agents dedicated 38.6% of commits to explicit refactoring and a further 53.9% as incidental cleanup (Horikawa et al., 6 Nov 2025). Agentic refactoring skews towards low-level, consistency-centric edits (Change Variable Type: 11.8%, Rename Parameter: 10.4%, Rename Variable: 8.5%), prioritizing maintainability (52.5%) and readability (28.1%) over duplication removal and architectural change (inverse of human trends).

Quality impact assessment yields statistically significant but modest improvements in structural metrics:

  • Class LOC: median Δ = –15.25
  • Weighted Methods/Class: median Δ = –2.07
  • Fan-Out, Fan-In: negligible effect size
  • No discernible impact on design smell counts.

Medium-level refactorings confer maximal structural improvement. Low-level edits marginally increase cyclomatic complexity, yielding a trade-off between readability and control-flow simplicity.

5. Performance, Energy, and Security Contributions

Agentic PRs span a broad spectrum of optimization behaviors.

  • Performance PRs: BERTopic analysis clusters 1,221 PRs into 52 topics covering compiler, build chain, caching, database, network, hardware, analytics, UI, AI inference, and infrastructure (Opu et al., 31 Dec 2025). Acceptance rates and merge times vary by layer: low-level changes (>85% acceptance, <20 h review latency) are trusted; UI, analytics, and AI-specific optimizations (<65% acceptance, >48 h latency) face scrutiny.
  • Energy-Aware PRs: Among 216 manually confirmed PRs (Mitul et al., 31 Dec 2025), work is classified as Insight (36.6%), Setup (6.5%), Optimization (36.6%), Trade-off (4.6%), Maintenance (16.2%). Optimization PRs have lower acceptance (83%) and longer merge times, largely due to maintainability overhead (median change size: 45 lines vs. 12 for all others, more files touched).
  • Security PRs: Security-relevant agentic PRs constitute ≈4% of activity (Siddiq et al., 1 Jan 2026). Agents perform supporting actions (testing, documentation, error handling, config), with merge rates at 61.5% (vs. 77.3% for non-security PRs). Security PRs are reviewed slower (median 3.92 h), and rejection odds increase with PR complexity and verbosity, not explicit keyword usage.

6. Review Effort, Process-Centric Metrics, and Trust

Agentic PRs trigger distinctive review regimes. Analysis reveals a bimodal split: 28.3% instant merges (narrow-scope), 71.7% iterative loops with notable ghosting (rejection without agent follow-up) (Minh et al., 2 Jan 2026). Predictive triage (Circuit Breaker, LightGBM) based on static features (additions, total changes, files touched, body length, plan presence, agent identity) achieves high AUC (0.957), intercepting up to 69% of total review load at 20% budget. Semantic content (TF-IDF, CodeBERT) offers negligible incremental value.

Process-centric graph metrics (Node Count, Loop Count, Structural Breadth) derived from Graphectory (Liu et al., 2 Dec 2025) expose inefficiencies: chaotic backtracking, file churn, prolonged localization, and patching loops. Anti-patterns (RepeatedView, Scroll, UnresolvedRetry, EditReversion) are prevalent even in successful agentic runs, suggesting the need for structure-aware navigation, syntax-guided editing, and efficiency-aware prompting.

Message–code inconsistency in agentic PRs remains a critical trust risk (Gong et al., 8 Jan 2026). High inconsistency (1.7% of PRs) yields a 51.7% absolute drop in acceptance rate and a 3.5× increase in merge time, disproportionately affecting reviewer confidence.

7. Implications for Autonomous Software Development Practices

Agentic pull requests have transitioned from experimental workflows to mainstream developer practice in open-source and industrial contexts. Key empirical implications include:

  • Autonomous agents are capable of substantive test, refactor, documentation, and infrastructure contribution, with acceptance rates ranging 44–85% for test PRs and 83.8% overall in specific agent studies (Haque et al., 7 Jan 2026, Watanabe et al., 18 Sep 2025).
  • Review and merge outcomes are largely determined by structural and process signals (size, code churn, scope), agent identity, and task type—rather than PR description semantics or explicit security terms.
  • Persistent challenges include inconsistent messaging, vulnerability-aware dependency practice, maintainability trade-offs in energy and performance optimizations, and suboptimal process trajectories.
  • Best practices mandate integrated quality assurance pipelines: automated testing, vulnerability screening, commit hygiene enforcement, intent labeling, and review-aware prompting.

Ongoing monitoring and empirical analysis are essential to ensure agentic workflows maintain or improve upon the standards established by human-centric development. Researchers and practitioners are advised to combine quantitative metrics, process-centric analyses, and human–AI collaboration studies to further refine autonomous contribution models and address the challenges surfaced in recent empirical literature.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Agentic Pull Requests.