The Impact of Large Language Models (LLMs) on Code Review Process
Abstract: LLMs have recently gained prominence in the field of software development, significantly boosting productivity and simplifying teamwork. Although prior studies have examined task-specific applications, the phase-specific effects of LLM assistance on the efficiency of code review processes remain underexplored. This research investigates the effect of GPT on GitHub pull request (PR) workflows, with a focus on reducing resolution time, optimizing phase-specific performance, and assisting developers. We curated a dataset of 25,473 PRs from 9,254 GitHub projects and identified GPT-assisted PRs using a semi-automated heuristic approach that combines keyword-based detection, regular expression filtering, and manual verification until achieving 95% labeling accuracy. We then applied statistical modeling, including multiple linear regression and Mann-Whitney U test, to evaluate differences between GPT-assisted and non-assisted PRs, both at the overall resolution level and across distinct review phases. Our research has revealed that early adoption of GPT can substantially boost the effectiveness of the PR process, leading to considerable time savings at various stages. Our findings suggest that GPT-assisted PRs reduced median resolution time by more than 60% (9 hours compared to 23 hours for non-assisted PRs). We discovered that utilizing GPT can reduce the review time by 33% and the waiting time before acceptance by 87%. Analyzing a sample dataset of 300 GPT-assisted PRs, we discovered that developers predominantly use GPT for code optimization (60%), bug fixing (26%), and documentation updates (12%). This research sheds light on the impact of the GPT model on the code review process, offering actionable insights for software teams seeking to enhance workflows and promote seamless collaboration.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Practical Applications
Immediate Applications
The following items can be implemented with current tools and workflows to improve pull-request (PR) reviews, based on the paper’s empirical findings of reduced resolution time (≈61%), review time (≈66.7%), and waiting time before acceptance (≈87.5%) in GPT-assisted PRs.
- Phase-aware LLM reviewer for GitHub/GitLab (Software, DevOps)
- Use case: A GitHub Action or GitLab CI job that triggers on PR submission and first review to generate structured, phase-specific assistance: code optimization suggestions, bug-fix hints, and documentation updates.
- Potential tools/products/workflows:
- “AI Review Triage” GitHub Action that posts initial comments within 1 hour (targeting the At Review phase).
- “Waiting-time reducer” bot that proposes actionable changes during iterative review cycles (At Waiting before Change phase).
- Prompt libraries tailored to refactoring, error-handling insertion, and doc updates.
- Assumptions/dependencies: Reliable access to LLM APIs; repository context retrieval; human-in-the-loop gating; adherence to project coding standards; maintainers’ acceptance of AI-generated comments; guardrails to prevent hallucinations and irrelevant suggestions.
- PR summarization and checklist generation (Software, DevOps, Education)
- Use case: Automated summaries of diffs and a checklist of reviewer concerns, reducing cognitive load and speeding first-response times.
- Potential tools: “PR Summary Bot” that produces rationale, risks, test impacts, and documentation changes; integration with tools like Carllm for comprehensibility.
- Assumptions/dependencies: Sufficient context window; consistent codebase conventions; privacy-safe handling of code snippets.
- Documentation assistant during reviews (Software, OSS communities)
- Use case: Automate docstring updates, README changes, and API docs alignment when code changes are suggested.
- Potential tools: “AI Doc Updater” that proposes synchronized documentation edits with code diffs.
- Assumptions/dependencies: Human review required for accuracy; project-specific documentation formats supported.
- Review bottleneck dashboards and SLAs (DevOps, Engineering Management)
- Use case: Instrumentation to track PR phase timings, flag bottlenecks, and set internal SLAs for first review and change iteration.
- Potential tools: Dashboards using GitHub API to show time-to-first-review and waiting-time before change; alerts for stalled PRs.
- Assumptions/dependencies: Correct phase mapping; stable event timelines; team buy-in for SLA enforcement.
- AI-use disclosure and audit trails in PRs (Policy, Compliance; applicable in Healthcare, Finance, Energy)
- Use case: Require a PR template checkbox or tag indicating where GPT assistance was used (review vs. implementation), enabling auditability in regulated environments.
- Potential tools: PR templates with “AI-assisted” tags; storage of AI suggestion diffs; on-prem or private LLMs for sensitive code.
- Assumptions/dependencies: Organizational policies; secure data handling; license and IP reviews for AI-generated code; model deployment constraints (on-prem vs. cloud).
- Developer training modules on AI-assisted code review (Academia, Corporate L&D)
- Use case: Courses and workshops that teach effective prompting, phase-specific use, and human-in-the-loop review patterns.
- Potential tools: Curricula using the paper’s public dataset and scripts; exercises focusing on refactoring and error-handling enhancements.
- Assumptions/dependencies: Updated teaching materials; availability of sandbox repos; instructor familiarity with LLM tooling.
- Freelancer and small-team workflows (Daily life, Software)
- Use case: Individual developers adopt GPT to pre-review PRs, add missing error handling, and prepare documentation, shortening the time to reviewer acceptance.
- Potential tools: Local scripts or VS Code extensions integrated with PR platforms; lightweight “AI pre-review” checklist.
- Assumptions/dependencies: LLM access; cost constraints; consistent application of results; peer review still required for quality assurance.
- OSS contribution guidelines with AI guardrails (Policy, OSS)
- Use case: Project maintainers publish guidelines for acceptable AI-assisted review contributions and mandate human verification.
- Potential tools: CONTRIBUTING.md updates; automated linting to detect trivial/irrelevant AI comments; enforce disclosure tags.
- Assumptions/dependencies: Community consensus; moderation capacity; safeguards against spammy or low-signal AI feedback.
- Sector-specific adoption with guardrails (Healthcare, Finance, Energy, Robotics)
- Use case: Teams building regulated or safety-critical software use LLMs for review-phase refactoring and documentation—never to bypass human approval.
- Potential tools: On-prem LLMs with strict access controls; integration with static analysis tools to cross-check AI suggestions.
- Assumptions/dependencies: Regulatory compliance; robust audit logs; integration with existing secure CI/CD pipelines; risk assessment frameworks.
Long-Term Applications
These opportunities require further research, scaling, integration work, and/or standardization before broad deployment.
- Phase-aware autonomous review agents (Software, DevOps; cross-sector)
- Use case: Agents that orchestrate PR workflows—prioritize reviews, propose changes, track waiting states, and adapt prompts based on phase and project history.
- Potential tools: “PR Phase Orchestrator” that blends LLMs with static analysis and test coverage signals; dynamic assignment of reviewers to reduce latency.
- Assumptions/dependencies: Reliable context retrieval across large repos; reducing irrelevant suggestions highlighted by prior industry studies; human governance; robust evaluation against quality, not just speed.
- Comprehensive quality impact studies and benchmarks (Academia, Industry R&D)
- Use case: Randomized trials measuring defect rates, post-merge incidents, reviewer workload, and acceptance quality—beyond time reductions.
- Potential tools: Public benchmarks for PR-phase assistance; shared datasets annotating AI vs. human suggestions; reproducible experimental pipelines.
- Assumptions/dependencies: Access to diverse repositories; agreed-upon quality metrics; IRB/ethics for developer studies.
- Predictive review-time estimation and scheduling (Engineering Management, DevOps)
- Use case: Models predicting time-to-first-review and waiting time that dynamically route PRs to available reviewers and trigger AI triage.
- Potential tools: Integrations combining latency prediction models with LLM-assisted triage; SLA-backed queue management.
- Assumptions/dependencies: Accurate historical telemetry; fair load balancing; buy-in from teams; alignment with working hours and reviewer expertise.
- Standardization and regulation of AI-in-the-loop code review (Policy, Compliance; highly relevant to Healthcare, Finance, Energy)
- Use case: Industry standards for disclosing AI assistance, retaining audit trails, and controlling data flows for sensitive code.
- Potential tools: Compliance frameworks and certification programs; standardized PR metadata for AI usage; secure model hosting guidelines.
- Assumptions/dependencies: Multi-stakeholder consensus; evolving legal landscape; enforceability within diverse toolchains.
- Secure, domain-adapted LLMs and retrieval (Healthcare, Finance, Energy, Robotics)
- Use case: On-prem, domain-specialized models that understand project-specific APIs and constraints, reducing hallucinations and enhancing relevance.
- Potential tools: Fine-tuned models with retrieval augmented generation (RAG) over internal codebases; policy-driven prompt filters.
- Assumptions/dependencies: High-quality domain corpora; compute budgets; MLOps maturity; ongoing evaluation against security/privacy requirements.
- Workforce evolution: the “AI pair reviewer” role (Industry)
- Use case: Dedicated roles or responsibilities to curate AI prompts, validate suggestions, and ensure consistency with coding standards and security practices.
- Potential tools: Role definitions, performance metrics, and guidelines that institutionalize human-AI collaboration in reviews.
- Assumptions/dependencies: Training programs; acceptance by engineering leadership; clear accountability boundaries.
- Marketplace of review policies and prompt packs (Software ecosystem)
- Use case: Shareable, project-specific review policies and prompt templates that encode best practices (error handling, refactoring, documentation consistency).
- Potential tools: Registries of prompt packs; policy engines that enforce phase-specific gates; integrations with linters and SAST tools.
- Assumptions/dependencies: Interoperability across platforms; maintenance of prompt packs; governance to avoid drift or bias.
- Multi-metric optimization for PRs (Industry R&D)
- Use case: Systems that optimize both speed and quality—balancing time-to-merge against defect density, test coverage changes, and reviewer satisfaction.
- Potential tools: Reinforcement learning or multi-objective optimization atop PR telemetry; feedback loops that adjust AI behavior.
- Assumptions/dependencies: Reliable, multi-dimensional telemetry; careful objective design to prevent gaming; ethical considerations around developer monitoring.
- Education: longitudinal curricula and capstones (Academia)
- Use case: Programs that develop expertise in building, evaluating, and governing AI-assisted review systems, including ethics and compliance.
- Potential tools: Capstone projects using the released dataset and scripts; partnerships with industry to test tools on real repositories.
- Assumptions/dependencies: Stable access to datasets; institutional support; collaboration with open-source communities.
Notes on feasibility and external validity
- The study’s identification of GPT-assisted PRs relies on heuristics with reported labeling accuracy of ≈95%; generalization beyond GitHub and beyond GPT to other LLMs requires validation.
- Time reductions are correlational; deeper causality and quality outcomes (defect rates, maintainability) need controlled follow-up studies.
- Prior work observed occasional irrelevant suggestions and longer closure times with automated review tools, underscoring the need for human oversight and context-aware integration.
- Security, privacy, licensing, and IP concerns are critical in regulated sectors; on-prem deployments and strict audit trails may be necessary.
Collections
Sign up for free to add this paper to one or more collections.