Papers
Topics
Authors
Recent
2000 character limit reached

The Impact of Large Language Models (LLMs) on Code Review Process

Published 14 Aug 2025 in cs.SE | (2508.11034v2)

Abstract: LLMs have recently gained prominence in the field of software development, significantly boosting productivity and simplifying teamwork. Although prior studies have examined task-specific applications, the phase-specific effects of LLM assistance on the efficiency of code review processes remain underexplored. This research investigates the effect of GPT on GitHub pull request (PR) workflows, with a focus on reducing resolution time, optimizing phase-specific performance, and assisting developers. We curated a dataset of 25,473 PRs from 9,254 GitHub projects and identified GPT-assisted PRs using a semi-automated heuristic approach that combines keyword-based detection, regular expression filtering, and manual verification until achieving 95% labeling accuracy. We then applied statistical modeling, including multiple linear regression and Mann-Whitney U test, to evaluate differences between GPT-assisted and non-assisted PRs, both at the overall resolution level and across distinct review phases. Our research has revealed that early adoption of GPT can substantially boost the effectiveness of the PR process, leading to considerable time savings at various stages. Our findings suggest that GPT-assisted PRs reduced median resolution time by more than 60% (9 hours compared to 23 hours for non-assisted PRs). We discovered that utilizing GPT can reduce the review time by 33% and the waiting time before acceptance by 87%. Analyzing a sample dataset of 300 GPT-assisted PRs, we discovered that developers predominantly use GPT for code optimization (60%), bug fixing (26%), and documentation updates (12%). This research sheds light on the impact of the GPT model on the code review process, offering actionable insights for software teams seeking to enhance workflows and promote seamless collaboration.

Summary

  • The paper demonstrates that GPT assistance reduces PR resolution times by up to 87.5%, streamlining code review efficiency.
  • It employs a robust methodology with 25,473 GitHub PRs and statistical models to validate improvements in review, merge, and waiting phases.
  • GPT is primarily used for code enhancement and bug fixing, underscoring its role in optimizing collaboration and code quality.

The Impact of LLMs on Code Review Process

This study explores the integration of LLMs, particularly GPT, into the code review process within collaborative software development. By analyzing a substantial dataset of GitHub pull requests (PRs), the research examines the phase-specific influence of GPT assistance on PR efficiency, resolution times, and developer collaboration dynamics.

Dataset Curation and Methodology

The study begins with the curation of a substantial dataset encompassing 25,473 PRs across 9,254 GitHub projects. A combination of keyword-based detection and manual verification ensures 95% accuracy in distinguishing GPT-assisted PRs from non-assisted ones. This meticulous process uses heuristics focusing on project names, PR titles, file modifications, and PR body content to filter the dataset effectively. The identification process highlights discrepancies between GPT being used for project integration versus actual code review enhancement.

Phase-Specific Effects and Statistical Modeling

The core investigation centers around the PR lifecycle phases: review, waiting before change, change, and waiting after acceptance. GPT's influence exhibits a substantial reduction in merge times, with statistical models (including multiple linear regression and Mann-Whitney U tests) confirming these observations. Notably, GPT assistance reduces median resolution time by 61%, review time by 66.7%, and waiting time before acceptance by 87.5%, compared to non-assisted PRs. The statistical rigor of these findings is ensured via logarithmic transformations and robust statistical measures, validating the positive impact of LLMs in streamlining the code review process.

Developer Use of GPT Across PR Phases

Through an analysis of 310 GPT-assisted PRs, the study categorizes the primary tasks where GPT models contribute:

  • Enhancement: 60.26% of activities in the review phase involve code optimization, error handling, and performance improvements.
  • Bug Fixing: Predominant in 25.64% of review phase tasks, GPT aids in diagnosing and resolving code errors.
  • Documentation: Occurs in 11.54% of PRs, where GPT assists in creating or refining both new and existing documentation.
  • Implementation and Testing: Less frequent, highlighting GPT's support role rather than primary generation of new code components.

The systematic application of LLMs in these tasks underscores their utility in enhancing code quality and review efficiency, particularly in time-intensive phases such as review and waiting for changes.

Implications and Future Directions

For Practitioners

The results advocate for the integration of LLMs like GPT into software development workflows to improve collaboration and efficiency. By standardizing practices for GPT use in specific scenarios—such as code refactoring and documentation—teams can maximize productivity and maintain high-quality outcomes.

For Researchers

The study prompts further research on balancing GPT reliance and human judgment in code review. Exploring frameworks that ensure effective GPT integration and minimize unproductive interactions could bolster AI-driven development processes. Future studies might focus on long-term productivity impacts and develop metrics to assess LLM effectiveness in varied software engineering environments.

Conclusion

The investigation into LLM-driven code reviews reveals significant efficiencies in PR processes, particularly in reducing review and waiting times. Although still limited in some early and late lifecycle phases, GPT's contributions substantiate its role as an integral tool in software engineering, fostering improved collaboration, code quality, and development workflows. As LLMs advance, they are poised to drive innovation and efficiency across software development landscapes.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Practical Applications

Immediate Applications

The following items can be implemented with current tools and workflows to improve pull-request (PR) reviews, based on the paper’s empirical findings of reduced resolution time (≈61%), review time (≈66.7%), and waiting time before acceptance (≈87.5%) in GPT-assisted PRs.

  • Phase-aware LLM reviewer for GitHub/GitLab (Software, DevOps)
    • Use case: A GitHub Action or GitLab CI job that triggers on PR submission and first review to generate structured, phase-specific assistance: code optimization suggestions, bug-fix hints, and documentation updates.
    • Potential tools/products/workflows:
    • “AI Review Triage” GitHub Action that posts initial comments within 1 hour (targeting the At Review phase).
    • “Waiting-time reducer” bot that proposes actionable changes during iterative review cycles (At Waiting before Change phase).
    • Prompt libraries tailored to refactoring, error-handling insertion, and doc updates.
    • Assumptions/dependencies: Reliable access to LLM APIs; repository context retrieval; human-in-the-loop gating; adherence to project coding standards; maintainers’ acceptance of AI-generated comments; guardrails to prevent hallucinations and irrelevant suggestions.
  • PR summarization and checklist generation (Software, DevOps, Education)
    • Use case: Automated summaries of diffs and a checklist of reviewer concerns, reducing cognitive load and speeding first-response times.
    • Potential tools: “PR Summary Bot” that produces rationale, risks, test impacts, and documentation changes; integration with tools like Carllm for comprehensibility.
    • Assumptions/dependencies: Sufficient context window; consistent codebase conventions; privacy-safe handling of code snippets.
  • Documentation assistant during reviews (Software, OSS communities)
    • Use case: Automate docstring updates, README changes, and API docs alignment when code changes are suggested.
    • Potential tools: “AI Doc Updater” that proposes synchronized documentation edits with code diffs.
    • Assumptions/dependencies: Human review required for accuracy; project-specific documentation formats supported.
  • Review bottleneck dashboards and SLAs (DevOps, Engineering Management)
    • Use case: Instrumentation to track PR phase timings, flag bottlenecks, and set internal SLAs for first review and change iteration.
    • Potential tools: Dashboards using GitHub API to show time-to-first-review and waiting-time before change; alerts for stalled PRs.
    • Assumptions/dependencies: Correct phase mapping; stable event timelines; team buy-in for SLA enforcement.
  • AI-use disclosure and audit trails in PRs (Policy, Compliance; applicable in Healthcare, Finance, Energy)
    • Use case: Require a PR template checkbox or tag indicating where GPT assistance was used (review vs. implementation), enabling auditability in regulated environments.
    • Potential tools: PR templates with “AI-assisted” tags; storage of AI suggestion diffs; on-prem or private LLMs for sensitive code.
    • Assumptions/dependencies: Organizational policies; secure data handling; license and IP reviews for AI-generated code; model deployment constraints (on-prem vs. cloud).
  • Developer training modules on AI-assisted code review (Academia, Corporate L&D)
    • Use case: Courses and workshops that teach effective prompting, phase-specific use, and human-in-the-loop review patterns.
    • Potential tools: Curricula using the paper’s public dataset and scripts; exercises focusing on refactoring and error-handling enhancements.
    • Assumptions/dependencies: Updated teaching materials; availability of sandbox repos; instructor familiarity with LLM tooling.
  • Freelancer and small-team workflows (Daily life, Software)
    • Use case: Individual developers adopt GPT to pre-review PRs, add missing error handling, and prepare documentation, shortening the time to reviewer acceptance.
    • Potential tools: Local scripts or VS Code extensions integrated with PR platforms; lightweight “AI pre-review” checklist.
    • Assumptions/dependencies: LLM access; cost constraints; consistent application of results; peer review still required for quality assurance.
  • OSS contribution guidelines with AI guardrails (Policy, OSS)
    • Use case: Project maintainers publish guidelines for acceptable AI-assisted review contributions and mandate human verification.
    • Potential tools: CONTRIBUTING.md updates; automated linting to detect trivial/irrelevant AI comments; enforce disclosure tags.
    • Assumptions/dependencies: Community consensus; moderation capacity; safeguards against spammy or low-signal AI feedback.
  • Sector-specific adoption with guardrails (Healthcare, Finance, Energy, Robotics)
    • Use case: Teams building regulated or safety-critical software use LLMs for review-phase refactoring and documentation—never to bypass human approval.
    • Potential tools: On-prem LLMs with strict access controls; integration with static analysis tools to cross-check AI suggestions.
    • Assumptions/dependencies: Regulatory compliance; robust audit logs; integration with existing secure CI/CD pipelines; risk assessment frameworks.

Long-Term Applications

These opportunities require further research, scaling, integration work, and/or standardization before broad deployment.

  • Phase-aware autonomous review agents (Software, DevOps; cross-sector)
    • Use case: Agents that orchestrate PR workflows—prioritize reviews, propose changes, track waiting states, and adapt prompts based on phase and project history.
    • Potential tools: “PR Phase Orchestrator” that blends LLMs with static analysis and test coverage signals; dynamic assignment of reviewers to reduce latency.
    • Assumptions/dependencies: Reliable context retrieval across large repos; reducing irrelevant suggestions highlighted by prior industry studies; human governance; robust evaluation against quality, not just speed.
  • Comprehensive quality impact studies and benchmarks (Academia, Industry R&D)
    • Use case: Randomized trials measuring defect rates, post-merge incidents, reviewer workload, and acceptance quality—beyond time reductions.
    • Potential tools: Public benchmarks for PR-phase assistance; shared datasets annotating AI vs. human suggestions; reproducible experimental pipelines.
    • Assumptions/dependencies: Access to diverse repositories; agreed-upon quality metrics; IRB/ethics for developer studies.
  • Predictive review-time estimation and scheduling (Engineering Management, DevOps)
    • Use case: Models predicting time-to-first-review and waiting time that dynamically route PRs to available reviewers and trigger AI triage.
    • Potential tools: Integrations combining latency prediction models with LLM-assisted triage; SLA-backed queue management.
    • Assumptions/dependencies: Accurate historical telemetry; fair load balancing; buy-in from teams; alignment with working hours and reviewer expertise.
  • Standardization and regulation of AI-in-the-loop code review (Policy, Compliance; highly relevant to Healthcare, Finance, Energy)
    • Use case: Industry standards for disclosing AI assistance, retaining audit trails, and controlling data flows for sensitive code.
    • Potential tools: Compliance frameworks and certification programs; standardized PR metadata for AI usage; secure model hosting guidelines.
    • Assumptions/dependencies: Multi-stakeholder consensus; evolving legal landscape; enforceability within diverse toolchains.
  • Secure, domain-adapted LLMs and retrieval (Healthcare, Finance, Energy, Robotics)
    • Use case: On-prem, domain-specialized models that understand project-specific APIs and constraints, reducing hallucinations and enhancing relevance.
    • Potential tools: Fine-tuned models with retrieval augmented generation (RAG) over internal codebases; policy-driven prompt filters.
    • Assumptions/dependencies: High-quality domain corpora; compute budgets; MLOps maturity; ongoing evaluation against security/privacy requirements.
  • Workforce evolution: the “AI pair reviewer” role (Industry)
    • Use case: Dedicated roles or responsibilities to curate AI prompts, validate suggestions, and ensure consistency with coding standards and security practices.
    • Potential tools: Role definitions, performance metrics, and guidelines that institutionalize human-AI collaboration in reviews.
    • Assumptions/dependencies: Training programs; acceptance by engineering leadership; clear accountability boundaries.
  • Marketplace of review policies and prompt packs (Software ecosystem)
    • Use case: Shareable, project-specific review policies and prompt templates that encode best practices (error handling, refactoring, documentation consistency).
    • Potential tools: Registries of prompt packs; policy engines that enforce phase-specific gates; integrations with linters and SAST tools.
    • Assumptions/dependencies: Interoperability across platforms; maintenance of prompt packs; governance to avoid drift or bias.
  • Multi-metric optimization for PRs (Industry R&D)
    • Use case: Systems that optimize both speed and quality—balancing time-to-merge against defect density, test coverage changes, and reviewer satisfaction.
    • Potential tools: Reinforcement learning or multi-objective optimization atop PR telemetry; feedback loops that adjust AI behavior.
    • Assumptions/dependencies: Reliable, multi-dimensional telemetry; careful objective design to prevent gaming; ethical considerations around developer monitoring.
  • Education: longitudinal curricula and capstones (Academia)
    • Use case: Programs that develop expertise in building, evaluating, and governing AI-assisted review systems, including ethics and compliance.
    • Potential tools: Capstone projects using the released dataset and scripts; partnerships with industry to test tools on real repositories.
    • Assumptions/dependencies: Stable access to datasets; institutional support; collaboration with open-source communities.

Notes on feasibility and external validity

  • The study’s identification of GPT-assisted PRs relies on heuristics with reported labeling accuracy of ≈95%; generalization beyond GitHub and beyond GPT to other LLMs requires validation.
  • Time reductions are correlational; deeper causality and quality outcomes (defect rates, maintainability) need controlled follow-up studies.
  • Prior work observed occasional irrelevant suggestions and longer closure times with automated review tools, underscoring the need for human oversight and context-aware integration.
  • Security, privacy, licensing, and IP concerns are critical in regulated sectors; on-prem deployments and strict audit trails may be necessary.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 10 likes about this paper.