Coding Agents in Software Engineering
- Coding Agents are autonomous AI tools powered by LLMs that automate multi-step software tasks including code generation, repository navigation, testing, and PR submissions.
- They use an iterative closed-loop interaction approach, generating artifacts like configuration files, commit messages, and PR labels to ensure reproducibility and transparency.
- Empirical studies reveal significant efficiency gains and widespread adoption across projects, while also highlighting challenges in integration and socio-technical collaboration.
AI coding agents are autonomous software engineering tools powered by LLMs that automate multi-step development tasks, such as code generation, repository navigation, compilation, test execution, and submitting pull requests (PRs) to version-control platforms like GitHub. Unlike traditional code-completion LLMs (e.g., Copilot’s inline suggestions), coding agents operate as independent entities capable of understanding high-level tasks, engaging in iterative tool invocation, and generating persistent machine-readable artifacts—including configuration files (AGENTS.md, CLAUDE.md), commit messages, branches, and PRs. Their rapid adoption and distinctive behavioral traces have made them prominent actors in modern software engineering workflows, introducing new empirical opportunities and challenges for mining, measuring, and integrating agentic contributions (Robbes et al., 26 Jan 2026, Lulla et al., 28 Jan 2026, Matricon et al., 26 Jan 2026).
1. Architectural Principles and Operational Workflows
Contemporary coding agents (Codex, Claude Code, Cursor, Copilot agent-mode, etc.) leverage a closed-loop interaction paradigm. Provided with a natural-language task specification—often in the form of a GitHub issue—they:
- Ingest contextual artifacts (README, AGENTS.md, existing codebase, test suites).
- Iteratively select actions (file navigation, edit, compile, test, git operations) using LLM-generated tool calls.
- Evaluate progress via intermediate results, completion criteria (e.g., tests passing), and user/CI feedback.
- Synthesize multi-file code changes, update configuration, and submit PRs or direct commits.
Agents can be single-entity or part of multi-agent frameworks (e.g., MAGIS, CodeR), which assign specialized roles—manager, developer, QA engineer, repository custodian—and structure the workflow via planning and task graphs. Explicit protocols guide agents through retrieval (BM25, SBFL), role assignment, developer–QA feedback loops, evidence synthesis, and verification (Tao et al., 2024, Chen et al., 2024).
The critical distinction with line-oriented completion models is agentic autonomy: the ability to reason about and implement composite tasks that span multiple files, navigate codebases, and manage execution environments (Robbes et al., 26 Jan 2026). Agents maintain explicit traces, which support reproducibility and empirical auditing.
2. Artifacts, Traces, and Detection Heuristics
Coding agents produce a wide array of repository-level artifacts:
- Guidance/Manifest Files: AGENTS.md, CLAUDE.md, CURSOR.md encode architecture, build/test commands, coding conventions, operational rules, and agent personas. These files exhibit shallow markdown hierarchies—typically one H1, several H2, and H3s, with content dominated by technical details, operational instructions, and architecture summaries (Lulla et al., 28 Jan 2026, Chatlatanagulchai et al., 18 Sep 2025).
- Configuration Files: Tool- and agent-specific config files (e.g., .aider.conf.yml, .cursor/, .claude/) signal agent presence and are crucial for agentic operation.
- Commit Metadata: “Co-authored-by” trailers (e.g., [email protected], [email protected]), user accounts, and branch prefixes denote agentic contributions.
- PR Labels and Branches: Explicit branch name patterns (“codex/”, “cursor/”), PR labels (“ai-generated”, “codex”) provide further evidence of agentic activity.
- Task Results: Multi-file git diffs, test results, coverage reports, and detailed commit messages reflecting agent rationale.
Heuristics for agent detection rely on matched filename patterns, commit trailers, PR/branch labels, and user accounts, collectively covering >60 artifact types. Cross-validation confirms partial observability—40% of config file adopters lack signed agentic commits—necessitating multi-faceted approaches for robust agent identification (Matricon et al., 26 Jan 2026). These heuristics enable mining software repositories (MSR) studies for agentic impact, collaboration, and adoption trends.
3. Adoption, Task Scope, and Impact on Software Projects
Empirical studies estimate agentic adoption across 129,134 active GitHub projects at 15.85%–22.60% within one year of widespread agent release (Robbes et al., 26 Jan 2026). Adoption is broad, covering the full spectrum of project maturity, organization size, programming languages, and topics. Agents are more prevalent in younger and larger projects, but all deciles show substantial adoption.
Commit-level analysis reveals that AI-assisted commits, compared to human-only and rule-based bot commits, are:
- Significantly larger (median additions: 34 vs 10 lines), touch more files, and have higher churn rates.
- More likely to be feature-oriented (35.7% vs 17% in humans) and bug-fix focused (29.9% vs 27%), with lower rates of maintenance chores (7.1% vs 31%).
- Distributed across major languages (TypeScript, Python, JavaScript) and unique low-resource languages.
Agentic PRs typically bundle changes into single or few commits, focusing modifications into narrowly scoped, self-contained units. Structural organization differs from human PRs in commit count (, large effect), files touched (, medium), and description alignment—agentic PRs exhibit slightly higher semantic similarity between description and diff, reflecting LLM-driven coherence (Ogenrwot et al., 24 Jan 2026).
4. Efficiency, Guidance Engineering, and Repository Integration
The introduction of AGENTS.md (and analogous manifest files) yields substantial improvements in agent runtime and output efficiency. In controlled experiments:
- AGENTS.md presence reduced median agent wall-clock runtime by (98.57s 70.34s) and output token consumption by (2,925 2,440) (Lulla et al., 28 Jan 2026).
- Efficiency gains arise from reduced exploratory navigation, fewer redundant context-retrieval calls (“ls”, “grep”), and clearer build/test workflow execution.
Best practices for crafting effective guidance files include:
| Content Category | Required Features | Example/Notes |
|---|---|---|
| Project Description | Brief summary and primary goals | One H1, short paragraph |
| Architecture | High-level structure, key modules, entry points | Module/package diagram or bullet list |
| Build/Test/Lint Commands | Exact shell invocations, integration/test steps | Markdown fenced code blocks |
| Coding Conventions | Style guides (indentation, naming, file org) | Bullet lists, explicit examples |
| Agent Role | Persona, operational rules, scope of decision-making | Separate section (AI Integration) |
Guidance files are most effective when kept under version control, updated regularly, and referenced in developer documentation and CI checks. Their integration into distributed build systems ensures human–agent consistency and supports agentic triage and automation protocols.
5. Acceptance, Failure Modes, and Socio-Technical Dynamics
Coding agent pull requests exhibit high but variable acceptance rates:
- Claude Code–generated PRs: 83.8% merged, 54.9% as-is, 45.1% with human revisions (vs 91% acceptance for matched human PRs) (Watanabe et al., 18 Sep 2025).
- Revision requirements center on bug fixes (45.1%), documentation and refactoring (~25%), and style/CI adjustments (15–22%). Reviewer engagement is critical; 38% of rejected agentic PRs are abandoned pre-review (Ehsani et al., 21 Jan 2026).
- Merge success rates are highest for documentation (84%), CI (79%), and build updates (74%), lowest for performance (55%) and bug-fix tasks (64%) (Ehsani et al., 21 Jan 2026).
- Larger PRs and those failing CI/CD tests are less likely to merge; each failed CI check reduces odds by ~15% (Cliff’s ).
- Key rejection patterns include lack of reviewer engagement (38%), duplicate solutions (23%), unwanted features, CI/test failures, and technical misalignment.
These findings underscore the centrality of socio-technical integration: agents must decompose large tasks into minimal units, validate changes against CI before PR submission, and connect task selection to repository context (issues, roadmaps). Legal and governance compliance (CLAs, licensing) remains an unresolved challenge for agentic automation.
Security-related PRs constitute ~4% of agentic PRs, with lower merge rates (61.5% vs 77.3%) and greater review latency (median 3.92h vs 0.11h). PR complexity and verbosity, rather than explicit security terminology, are early predictors of rejection (Siddiq et al., 1 Jan 2026).
6. Evaluation Benchmarks, Model Training, and Limitations
Agent performance on repository-level tasks is benchmarked using datasets such as SWE-Bench, SWA-Bench, SWEE-Bench, and SWE-PolyBench:
- SWE-Bench (12 repos): agents pass ~4.6–8.2% of issues (AutoCodeRover v2), but these benchmarks have distributional mismatch—higher description quality, lower fix complexity—leading to performance overestimation (Vergopoulos et al., 10 Mar 2025).
- SWEE-Bench and SWA-Bench (hundreds of diverse repos): lower pass rates (~4–8%), confirming up to 40% accuracy drop on real-world, more complex tasks. Success rates negatively correlate with number of files and lines edited.
- SWE-PolyBench (multi-language): pass rates top out at ~24% in Python, much lower on JS/TS/Java. Agents struggle on multi-file edits, cross-file context, and deep syntax-tree retrieval (Rashid et al., 11 Apr 2025).
- AgentPack corpus (1.3M agent-human edits): co-authored commits are longer, multi-file, better documented, and span diverse languages and tasks (Zi et al., 26 Sep 2025). Models fine-tuned on AgentPack outperform prior human-only datasets on code-editing benchmarks.
Recent work (UTBoost) highlights test suite insufficiency: LLM-driven test augmentation uncovers 345 erroneous patches in SWE-Bench, inducing up to 41% leaderboard ranking changes. This motivates rigorous intramorphic testing and parser correction in benchmarking pipelines (Yu et al., 10 Jun 2025).
7. Fingerprinting, Attribution, and Empirical Research Implications
AI coding agents leave robust, discriminative behavioral fingerprints at the commit, PR, and code-patch level:
- Multi-class agent classifiers (XGBoost, 41 features) achieve 97.2% F1-score in agent identification; top features include multiline commit ratio (44.7% importance) and change-concentration Gini (10.1%) (Ghaleb, 24 Jan 2026).
- Codex: distinguished by multiline commit ratio and message length.
- Claude Code: unique density of conditional statements and comments per patch.
- Copilot: PR body length and distributed changes.
- Cursor: bullet points and checklist items.
- These fingerprints support governance (pre-merge checks), research validity (contaminant detection in human PRs), and agent design transparency.
Adversarial adaptation, temporal drift, and cross-platform generalization remain open areas for fingerprinting research.
8. Open Challenges and Future Directions
Research priorities include:
- Generalization of efficiency gains to broader agent architectures, larger/multi-module code changes, and non-code/dynamic tasks (Lulla et al., 28 Jan 2026).
- Correctness and semantic alignment measurement, beyond token/resource efficiency.
- Design of instruction files—optimal granularity, standardization, governance—to support portability and evolution.
- Improved detection heuristics for agent contributions, closing gaps in partial observability and evolving formats (Matricon et al., 26 Jan 2026).
- Benchmarking with diverse, frequently refreshed datasets capturing distributional realities of open-source repositories.
- Integration of agent-generated test suites for self-validation and performance improvement (Yu et al., 10 Jun 2025).
- Socio-technical factors: human-in-the-loop collaboration, reviewer engagement, and process integration.
- Attribution standards to distinguish agentic from human activity, facilitating empirical research and stable workflows (Robbes et al., 26 Jan 2026).
Persistent guidance artifacts, robust benchmarks, and transparent integration protocols will be central for advancing agentic coding practice and research.
References
- (Lulla et al., 28 Jan 2026) On the Impact of AGENTS.md Files on the Efficiency of AI Coding Agents
- (Ehsani et al., 21 Jan 2026) Where Do AI Coding Agents Fail? An Empirical Study of Failed Agentic Pull Requests in GitHub
- (Siddiq et al., 1 Jan 2026) Security in the Age of AI Teammates: An Empirical Study of Agentic Pull Requests on GitHub
- (Ogenrwot et al., 24 Jan 2026) How AI Coding Agents Modify Code: A Large-Scale Study of GitHub Pull Requests
- (Robbes et al., 26 Jan 2026) Agentic Much? Adoption of Coding Agents on GitHub
- (Matricon et al., 26 Jan 2026) Promises, Perils, and (Timely) Heuristics for Mining Coding Agent Activity
- (Zi et al., 26 Sep 2025) AgentPack: A Dataset of Code Changes, Co-Authored by Agents and Humans
- (Chatlatanagulchai et al., 18 Sep 2025) On the Use of Agentic Coding Manifests: An Empirical Study of Claude Code
- (Watanabe et al., 18 Sep 2025) On the Use of Agentic Coding: An Empirical Study of Pull Requests on GitHub
- (Tao et al., 2024) MAGIS: LLM-Based Multi-Agent Framework for GitHub Issue Resolution
- (Chen et al., 2024) CodeR: Issue Resolving with Multi-Agent and Task Graphs
- (Vergopoulos et al., 10 Mar 2025) Automated Benchmark Generation for Repository-Level Coding Tasks
- (Rashid et al., 11 Apr 2025) SWE-PolyBench: A multi-language benchmark for repository level evaluation of coding agents
- (Ghaleb, 24 Jan 2026) Fingerprinting AI Coding Agents on GitHub
- (Yu et al., 10 Jun 2025) UTBoost: Rigorous Evaluation of Coding Agents on SWE-Bench
- (Sayagh, 24 Dec 2025) What Makes a GitHub Issue Ready for Copilot?