Evolution of Code Review

Updated 4 July 2026

Evolution of code review is a historical progression from ad hoc, verbal inspections to structured, tool-mediated reviews that enhance teamwork and quality.
Modern code review serves as a quality gate, balancing defect detection, knowledge sharing, and process coordination through pull-request workflows and integrated tools.
Recent advancements leverage learning-based automation and LLMs to generate review comments and refine code, raising new challenges in accountability and trust.

Code review is the inspection of proposed source-code changes before integration into a shared code base. In the literature, it is characterized both as a white-box testing technique and, in its modern form, as a lightweight, change-based, tool-mediated, asynchronous practice whose goals extend beyond fault finding to include knowledge sharing, onboarding, and maintaining collective code ownership (Siddique, 2021, Badampudi et al., 2023). Its evolution spans informal peer checking, Fagan-style formal inspection, pull-request-centered workflows, and increasingly hybrid human–AI arrangements that reposition review within broader CI/CD and agentic software-engineering pipelines (Kamalı et al., 17 May 2026).

1. From informal inspection to pull-request workflows

A recurrent historical framing divides code review into five eras: an Ad Hoc Era in the late 1940s–1960s, a Formal Inspection Era in the 1970s–early 1990s, a Lightweight Peer Review Era in the mid-1990s–mid-2000s, an Integrated Code Review Era in the late 2000s–mid-2010s, and an Automation-Assisted Era from the early 2010s to the present (Kamalı et al., 17 May 2026). The transition point most often singled out is Michael Fagan’s 1976 inspection process at IBM, which introduced explicit roles and a six-step process—planning, overview, preparation, inspection meeting, rework, and follow-up—and could cost up to 3% of total project effort (Monperrus, 11 Jun 2026).

Era	Period	Dominant review form
Ad Hoc Era	late 1940s–1960s	verbal or hand-written marginal notes
Formal Inspection Era	1970s–early 1990s	synchronous, checklist-driven inspection
Lightweight Peer Review Era	mid-1990s–mid-2000s	asynchronous email-based patch review
Integrated Code Review Era	late 2000s–mid-2010s	web UIs, inline comments, merge gating
Automation-Assisted Era	early 2010s–present	static analysis and CI/CD as co-reviewers

Over time, formal inspection loosened into lighter-weight, asynchronous review of diffs in e-mail threads or web interfaces, culminating in pull-request-style workflows documented in industrial settings such as Microsoft and Google (Monperrus, 11 Jun 2026). Systematic mapping work on 177 papers from 2005 to 2018 identifies this transition as the emergence of “Modern Code Review” (MCR), distinguished from traditional inspection by being lightweight, change-based, tool-mediated, and asynchronous (Badampudi et al., 2023).

The significance of this transition is not merely procedural. Traditional inspections emphasized formal roles, synchronous meetings, and exhaustive logging, whereas MCR relocated review into everyday development flow through platforms such as Gerrit, GitHub Pull Requests, Phabricator, and Bitbucket (Kamalı et al., 17 May 2026). This suggests that the evolution of code review is inseparable from the evolution of software process itself: review moved from a periodic auditing mechanism to a continuous coordination layer embedded in collaborative development.

2. Code review as a quality gate and socio-technical process

Modern code review is commonly described as a multi-step negotiation. In a typical Gerrit-style workflow, an author pushes a patchset, reviewers inspect the change and supply comments, the author submits revised patchsets, and the request is eventually either merged or abandoned (Islam et al., 2019). Siddique characterizes the Code Review Process (CRP) as a white-box testing technique in which reviewers inspect source code to find defects, improve design, and share knowledge, while also noting both desired effects—better code quality, defect finding, learning, mutual responsibility, better solutions, and compliance with QA guidelines—and undesired effects such as extra staff effort, increased cycle time, and the potential to offend code authors (Siddique, 2021).

Two empirical strands dominate descriptions of the factors shaping review effectiveness. One emphasizes organizational embedding: long-term success depends on embedding review into standard workflows, and the process is shaped by existing culture, tooling, and team structure (Siddique, 2021). The other emphasizes reviewer- and patch-level determinants: reviewer experience is reported as the strongest positive predictor of review quality, while patch size and the number of code chunks are the next most influential factors on review time and thoroughness; author experience, reviewer workload, participation in discussion, complexity, tool support, and codebase familiarity also matter (Siddique, 2021).

The same process can be analyzed from the perspective of outcomes. Early prediction work on 146,612 review requests from LibreOffice, Eclipse, and GerritHub reports that several iterations often take place before a change is accepted and that around 12% of changes are abandoned, wasting inspection and rework effort (Islam et al., 2019). PredCR, a LightGBM-based classifier using 25 features across reviewer, author, project, text, and code dimensions, achieves an average AUC of around 85% and reaches approximately 99% ER@20% on average, while reviewer-dimension features are identified as the most informative (Islam et al., 2019).

These findings establish a durable misconception to avoid: code review is not reducible to a defect-detection microtask. It is simultaneously a quality gate, a coordination mechanism, a training channel, and a workload-allocation problem. A plausible implication is that changes in review practice are driven as much by organizational throughput constraints as by defect-detection efficacy.

3. Conformance, cognition, and communication

Repository-mining studies show that review changes code, not only decisions about code. An OpenStack study of 27,736 reviewed patches mined via the Gerrit API reports that post-review patches have consistently lower cross-entropy than their pre-review counterparts across all three programming languages and every n-gram order, indicating greater conformance to prior accepted coding patterns (Sri-iesaranusorn et al., 2021). In that work, conformance is measured by cross-entropy against a LLM trained on prior accepted patches,

$H = - (1/N) \sum_i \log_2 P_{model}(w_i \mid w_{i-n+1}\ldots w_{i-1}),$

so lower cross-entropy means that a patch is more predictable under the model of prior accepted code (Sri-iesaranusorn et al., 2021). Accepted patches also exhibit significantly lower cross-entropy than abandoned ones (Sri-iesaranusorn et al., 2021).

Recent ethnographic and think-aloud studies extend this picture from artifact change to reviewer cognition. The Code Review as Decision-Making (CRDM) model, derived from 10 participants and 34 review sessions, describes review as a two-phase process: an orientation phase used to establish context and rationale, followed by an analytical phase in which reviewers understand, assess, and plan the rest of the review (Heander et al., 13 Jul 2025). During this process reviewers make decisions about writing comments, finding more information, voting, running the code locally, and verifying CI results (Heander et al., 13 Jul 2025). A related theory-driven study of 25 real PR reviews proposes the Code Review Comprehension Model (CRCM), in which reviewers first perform context-building and then proceed through code inspection using discussion management, code reading, and testing before reaching a decision; reading strategies are explicitly non-linear and include Linear, Difficulty-Based, and Chunking tactics (Gonçalves et al., 27 Mar 2025).

A further extension treats code review as a communication network rather than merely a sequence of pairwise judgments. In that formalization, review discussions form an undirected, time-varying hypergraph $\mathcal{H} = (V,\mathcal{E},\rho,\xi,\psi)$ , where each review acts as a reciprocal, concurrent, time-limited communication channel among all participants (Dorner et al., 20 May 2025). In-silico diffusion experiments on Android, Visual Studio Code, React, Microsoft, Spotify, and Trivago show that code review can spread information both widely and quickly, but with an asymmetry: open-source systems spread information faster, whereas closed-source systems reach more participants (Dorner et al., 20 May 2025).

Taken together, these studies reframe review as more than “spot-the-bug.” Code review reshapes patches toward project norms, supports opportunistic code comprehension, and acts as a time-dependent infrastructure for information diffusion. This suggests that any historical account focused only on defect detection misses three central developments: conformance to local patterns, recognition-based decision-making, and communication at organizational scale.

4. From heuristic tooling to generation-based automation

Automation in code review predates LLMs. Earlier tool support included static-analysis systems such as ReviewBot, FindBugs, and linters, which were fast and precise for well-defined patterns but rigid and limited for higher-level suggestions (Siow et al., 2019). Learning-based systems then attempted to infer review behavior directly from historical data. CORE, for example, automates review recommendation using only code changes and corresponding reviews by combining word-level and character-level embeddings with an attentional deep learning model; on 57,260 $\langle$ code change, review $\rangle$ pairs from 19 Java projects, it improves Recall@10 from 0.208 to 0.482 and MRR from 0.093 to 0.234 relative to DeepMem (Siow et al., 2019).

A parallel line of work targeted review triage and outcome prediction. PredCR forecasts, immediately after the first patchset, whether a change will be merged or abandoned, using features such as avg_reviewer_experience, avg_reviewer_review_count, num_of_reviewers, author_merge_ratio, project_merge_ratio, description_length, lines_added, files_modified, and modify_entropy (Islam et al., 2019). Beyond its average AUC of around 85%, the study reports that performance improves as a software system evolves with new data and that predictions can be updated across multiple revisions, with AUC improving by up to 15% relatively from the first to the final revision (Islam et al., 2019).

Generation-based automation recast review as a sequence-to-sequence problem. “Towards Automating Code Review Activities” trains two transformer-based models: a contributor-side model that learns $m_s \rightarrow m_r$ and a reviewer-side model that learns $\langle m_s, r_{nl}\rangle \rightarrow m_r$ from 17,194 abstracted triplets (Tufano et al., 2021). On the test set, the contributor-side model yields 271 perfect patches out of approximately 1,720 test methods, about 15.8%, while the reviewer-side model reaches 528 perfect patches, about 30.7% (Tufano et al., 2021). A broader comparative study defines three generation-based tasks—Code Revision Before Review (CRB), Code Revision After Review (CRA), and Review Comment Generation (RCG)—and shows that CodeT5 outperforms the prior state of the art by 13.4%–38.9% in two code revision generation tasks; it also introduces Edit Progress (EP) to capture partial improvement rather than Exact Match alone (Zhou et al., 2023).

The historical pattern here is cumulative rather than abrupt. Early automation focused on retrieval, ranking, static checks, and reviewer recommendation; later work moved to code refinement and natural-language comment generation. A plausible implication is that automation advanced first on tasks with clearer supervision signals and only later on tasks requiring richer semantic and conversational modeling.

5. LLM assistance, agentic review, and competing futures

Field evidence from industrial deployment shows that LLMs are changing review workflows, but not in a single uniform direction. At WirelessCar Sweden AB, an empirical study combined a field study with a field experiment involving two LLM-assisted workflows: Mode A, an AI-Led “Co-Reviewer” that automatically generates a structured review when a PR is loaded, and Mode B, an On-Demand “Interactive Assistant” that answers only when asked targeted questions (Aðalsteinsson et al., 22 May 2025). Both modes use a retrieval-augmented generation pipeline built on LlamaIndex and OpenAI’s o4-mini model, with semantic tools search_pr, search_code, and search_requirements; Mode A adds start_review to guarantee coverage of every file change (Aðalsteinsson et al., 22 May 2025). Developers preferred AI-led reviews overall, but preferences depended on familiarity with the codebase and the severity of the pull request (Aðalsteinsson et al., 22 May 2025).

Practitioner studies indicate that review remains central even as AI enters the process. Among ninety-two practitioners across four organizations, the reported median time spent on reviews is about three hours per week, and 47.3% expect to spend more time on reviewing in five years, compared with 22% who expect it to decline (Dorner et al., 9 Aug 2025). The same study anticipates a broader range of artifacts being reviewed: production code rises from 84.6% to 87.9%, test code from 63.7% to 68.1%, parameter and configuration files from 60.4% to 69.2%, documentation from 53.9% to 63.7%, and GUI-based end-to-end tests from 17.6% to 26.4% (Dorner et al., 9 Aug 2025). Dorner et al. organize anticipated futures along two continua—Code Author: Human $\leftarrow\!\!-\!\!-\!\!\rightarrow$ LLM and Code Reviewer: Human $\leftarrow\!\!-\!\!-\!\!\rightarrow$ LLM—yielding four archetypes: human-led software engineering, automated code review, automated code generation, and unsupervised software engineering (Dorner et al., 9 Aug 2025).

Vision work on agentic review pushes this trend from isolated assistants to end-to-end orchestration. A proposed AI-powered workflow spans five stages—PR Creation, PR Augmentation, Reviewer Selection, AI-Assisted Code Review, and PR Retrospective—with specialized agents for issue linking, fix suggestion, change impact analysis, runtime analysis, toxicity measurement, usefulness measurement, and retrospective summarization, while humans remain at key quality gates (Kamalı et al., 17 May 2026). The stated risks include reliability and hallucinations, bias and security, limited generalization, accumulated error in multi-agent pipelines, transparency, accountability, privacy, automation bias and knowledge deterioration, evaluation and metrics, and economic costs (Kamalı et al., 17 May 2026).

The main controversy concerns whether AI should augment human review or supersede it. Monperrus argues that coding agents have crossed a threshold at which traditional human code review is no longer a necessary component of a software quality pipeline, that agents meet every code-review goal at lower cost and higher throughput, and that the hybrid model in which agents write code and humans remain mandatory reviewers is a dead end (Monperrus, 11 Jun 2026). By contrast, practitioner and theory-oriented studies warn of possible erosions of understanding, accountability, and trust if LLMs author and review too much of the software lifecycle (Dorner et al., 9 Aug 2025). The disagreement is substantive rather than terminological: it concerns the future locus of assurance, responsibility, and organizational learning.

6. Research maturation, benchmarks, and unresolved questions

Research on modern code review has expanded rapidly. A systematic mapping of 177 papers classifies work from 2005 to 2018 into six top-level categories—MCR Process, Contributor/Reviewer, Tool-based Solutions, Source-Code Artifacts, Review Comments, and Other—and shows publication counts rising from near zero in 2005–2007 to roughly 40 papers in 2018 (Badampudi et al., 2023). The most-studied aspects are tool-supported solutions, impact/outcome of MCR, and reviewer selection, while efficiency studies, perception studies, and comment-level assessment remain under-studied (Badampudi et al., 2023).

Benchmark-oriented surveys show a further methodological shift in the LLM era. A survey of 99 papers spanning the Pre-LLM era (58 studies from January 2015 to late 2021) and the LLM era (41 studies from early 2022 to December 2025) organizes code-review automation into five domains and 18 fine-grained tasks (Khan et al., 13 Feb 2026). It reports a clear shift toward end-to-end generative peer review, increasing multilingual coverage, and a decline in standalone change-understanding tasks; single-language datasets fall from 59% in the Pre-LLM era to 24% in the LLM era, while 76% of LLM-era datasets are multilingual (Khan et al., 13 Feb 2026).

A complementary survey of automation research, based on 691 candidate publications and 24 quantitatively relevant studies from May 2015 to April 2024, highlights fragmentation rather than consolidation (Heumüller et al., 25 Aug 2025). Across those studies, there are 48 task–metric combinations, 22 unique to their original paper, limited dataset reuse, frequent omission of simple interpretable baselines, and widespread threats from temporal bias and target leakage because only one paper explicitly addresses time-aware splitting (Heumüller et al., 25 Aug 2025). The same survey notes that practical applicability remains limited by low exact-match rates, fragmented datasets, and inconsistent evaluation practices (Heumüller et al., 25 Aug 2025).

Several frontier directions are now visible. Synthetic-data generation for low-resource review recommendation is one: translating labeled Java diffs into synthetic C++ diffs with GPT-4o and fine-tuning CodeBERT on the synthetic data yields 0.65 accuracy, 0.65 precision, 0.68 recall, and 0.66 F1 on a held-out real C++ test set, matching or slightly exceeding a real-data baseline (Cohen et al., 5 Sep 2025). Another is review-comment usefulness, where results across studies span roughly 63%–87% accuracy but remain constrained by closed industrial datasets, cold-start problems for experience-based signals, under-explored code-centric comment features, and uncertain cross-project generalizability (Ahmed et al., 2023).

The historical trajectory of code review research therefore has two layers. At the practice layer, review has moved from informal and synchronous human inspection to asynchronous, tool-mediated, and increasingly AI-mediated workflows. At the research layer, the field has moved from process description and human factors toward prediction, recommendation, generation, and benchmark design. This suggests that the next phase will depend less on whether automation is possible than on how review systems preserve correctness, context fidelity, accountability, and human understanding while scaling to AI-accelerated software production.