Review Comment Generation

Updated 17 May 2026

Review Comment Generation is a technique that automatically produces natural language feedback for code changes and scholarly articles.
Recent advances employ LLM-based data cleaning, retrieval-augmented generation, and RL-driven optimization to enhance comment quality and relevance.
Agent-based frameworks and domain-specific evaluation metrics further improve the actionability and applicability of generated review comments.

Review Comment Generation refers to the automatic production of evaluative feedback in natural language for either code changes or scholarly articles. In software engineering, the focus is on translating code diffs or patches into actionable code review comments; in academic peer review, it centers on generating domain-relevant critique for submitted manuscripts. The last several years have established Review Comment Generation as a canonical sequence-to-sequence (seq2seq) task, with the research landscape evolving rapidly from heuristic and traditional ML models to deep-learning approaches—including LLMs, retrieval-augmented generation, and RL-based frameworks.

1. Task Definitions, Core Challenges, and Data Quality

Review Comment Generation (RCG) is fundamentally a mapping: for code, from a given code delta $\Delta$ (typically a diff or patch) to a natural-language comment $C$ that pinpoints issues and suggests improvements; for peer-review, from document content and context to reviewer reports or summary feedback.

Key Technical Challenges

Noisy data: Large open-source datasets (e.g., CodeReviewer (Liu et al., 4 Feb 2025)) exhibit prevalence of vague, non-actionable, or meta-comments that dilute training and reduce generative quality. Simple rule-based or surface-feature filtering leaves a significant fraction of low-utility data.
Semantic alignment: Generating high-value, actionable comments warrants semantic understanding linking code changes (or manuscript sections) to the feedback provided.
Output diversity and non-uniqueness: Many valid comments can be authored for the same input, complicating metric-based evaluation and learning objectives.
Reference reliability: Empirical analyses show that fewer than 10% of benchmark comments in major datasets are suitable for automation (e.g., actionable, specific, relevant) (Lu et al., 2024).

Data Cleaning Advances

Recent work leverages instruction-tuned LLMs (e.g., GPT-3.5, Llama-3) as semantic filters, achieving 66–85% precision for detecting valid code review comments (Liu et al., 4 Feb 2025). Cleaning, rather than mere data reduction, leads to consistent and statistically significant BLEU-4 gains (e.g., up to +13% for valid comments, and increased informativeness and relevance by 24% and 11%, respectively).

Dataset Version	BLEU-4 (Valid)	Informativeness	Relevance
Original	6.17	~3.4	~2.5
LLM-Cleaned	6.97	~4.3	~2.8

LLM-based cleaning offers favorable cost/quality trade-offs and points to future integration of ensemble and retrieval-augmented filters.

2. Model Architectures and Training Paradigms

Sequence-to-Sequence Base Models

The standard baseline remains encoder–decoder transformers (e.g., CodeT5, T5, CodeReviewer), trained on code-review pairs (Lin et al., 2024, Liu et al., 4 Feb 2025).

Experience-Aware and Agent-Based Specialization

Experience-Aware Training: Reviewer expertise, measured via Authoring Code Ownership (ACO) and Review-Specific Ownership (RSO), is encoded as per-example loss weights (Experience-Aware Loss Functions, or ELF), yielding substantial improvements in applicability, informativeness, and type coverage (functional, evolvability, discussion issues) (Lin et al., 2024).
Agent-Based Frameworks: RevAgent explicitly decomposes the task into parallel commentator agents (for Refactoring, Bugfix, Testing, Logging, Documentation), overseen by a critic agent that selects the most appropriate response. This specialization yields significant gains in BLEU (+12.9%), ROUGE-L (+10.9%), and semantic metrics (Li et al., 1 Nov 2025).

Model	BLEU-4	ROUGE-L	METEOR	SBERT-cosine
LLaMA-Reviewer (single-model)	7.60	17.84	12.07	43.78
RevAgent	8.61	19.73	12.97	48.35

Low-Rank Adaptation and Prompt Engineering

Parameter-efficient approaches such as QLoRA (quantized LoRA) allow consumer-scale fine-tuning of large models while integrating control-flow metadata (function call graphs) (Haider et al., 2024). Metadata-augmented prompting provides further boosts, with call-graph inclusion augmenting BLEU-4 on GPT-3.5 Turbo by +0.5%.

3. Retrieval-Augmented and Distillation Approaches

Retrieval-Augmented Generation (RAG)

Hybrid architectures such as RAG-Reviewer (Hong et al., 13 Jun 2025) couple code IR with generation, retrieving $K$ nearest neighbor (code, comment) pairs as input. This method:

Improves BLEU-4 by up to +4.25% over generation-only models.
Significantly enhances low-frequency token recall (+24%), critical for semantically unique or rare identifiers.
Balances flexibility (generation) and exemplar precision (retrieval), and is robust across code/comment length scales.

Model	Exact Match	BLEU (%)
CodeReviewer (gen-only)	1.23	9.27
Pair CodeReviewer (RAG)	2.76	13.52

Cross-Task Distillation

DISCOREV (Sghaier et al., 2023) jointly trains a comment generator (student) and a code-refinement model (teacher) via cross-task feedback:

Student loss combines classical cross-entropy on review comments and weighted refinement losses.
Optional embedding-alignment loss penalizes divergence of latent spaces for comments and code edits, further boosting performance.
BLEU-4 improvements reach ≈38% over CodeReviewer, with strongest per-language gains on PHP and Go.

4. Reward-Driven and RL Optimization

CoRAL (Sghaier et al., 4 Jun 2025) refines review comment generation by incorporating reward models in a reinforcement learning (RL) setup:

Reward signals include SBERT-based semantic similarity and a downstream code-refinement correctness reward (CrystalBLEU).
RL with Proximal Policy Optimization (PPO), regularized by a KL penalty, ensures policy updates do not excessively depart from the base model.
Highest BLEU-4 and downstream human judgments are achieved by optimizing for CrystalBLEU (+1.62 vs. strong RL/embedding methods).

This paradigm aligns the model’s objectives more closely with practical utility—generating feedback that leads to effective subsequent changes.

5. Evaluation Methodologies and Metrics

Limitations of Traditional Metrics

Text-similarity metrics (BLEU, ROUGE-L) are only weakly correlated with review comment quality (Lu et al., 2024). Less than 10% of “gold” comments in common datasets are considered actionable and specific. Surrogates such as n-gram overlap fail on the multidimensionality of actionable, explanatory, and targeted feedback.

DeepCRCEval and Fine-Grained Criteria

DeepCRCEval (Lu et al., 2024) introduces a multidimensional evaluation framework with nine criteria: readability, relevance, explanation clarity, problem identification, actionability, completeness, specificity, contextual adequacy, and brevity. LLM-based evaluators, leveraging models such as GPT-4, yield high scoring agreement (ICC ≥0.75), greatly increase scalability (88.8% faster, 90.3% cheaper vs. humans), and more effectively distinguish high from low-quality outputs.

Model	Mean S (1–10)
LLM-Reviewer	9.7
Tufano et al.	4.0
CCT5	4.2

User Studies

Field deployments (e.g., RevMate (Olewicki et al., 2024)) in Mozilla and Ubisoft environments show that 7–8% of LLM-generated comments are accepted, with refactoring suggestions being ~4× more likely to be adopted than functional ones. Accepted LLM suggestions trigger patch revisions at the same rate as human comments (≈74%).

6. Review Comment Generation for Peer Review

The modular guided approach in MOPRD (Lin et al., 2022) segments review into predefined aspects (e.g., Basic Reporting, Experimental Design, Validity), each generated by aspect-specific modules—yielding superior ROUGE and BARTScore against extract-and-generate or segmentationless approaches. Human raters in multidisciplinary fields confirm gains in structure and usefulness.

RbtAct (Wu et al., 10 Mar 2026) targets segment-level, perspective-conditioned feedback optimized for actionability using author rebuttals as supervision. Training on the RMR-75K dataset (review→rebuttal mappings), combined with Direct Preference Optimization (DPO), yields the highest actionability and specificity scores in both human and LLM-judge assessments, outperforming strong proprietary and open baseline models.

System	Human (Actionability)	LLM-judge (Actionability)
RbtAct (full)	3.46	3.38
DeepSeek-V3.2	3.15	3.13

This evidences that rebuttal-aligned and segment-focused training lead to more implementable and relevant peer review comments.

7. Future Directions and Open Challenges

Quality-aligned training and filtering: Continued advances in combining LLM-based data cleaning, expert weighting, and retrieval augmentation are likely to further raise output utility (Liu et al., 4 Feb 2025, Lin et al., 2024).
Downstream and preference-based rewards: Semantic and edit-based rewards, as well as preference learning based on actual author or developer response, are crucial to driving actionable feedback (Sghaier et al., 4 Jun 2025, Wu et al., 10 Mar 2026).
Multidimensional and realistic evaluation: Composite, criteria-driven metrics (e.g., DeepCRCEval) and large-scale practical deployments remain essential for meaningful benchmarking (Lu et al., 2024, Olewicki et al., 2024).
Agent decomposition and category-awareness: Agent-based specialization, as well as explicit conditioning on issue types or review perspectives, systematically elevates granularity and relevance (Li et al., 1 Nov 2025, Wu et al., 10 Mar 2026).
Domain transferability: Most work remains focused on open-source code and select peer-review venues; generalization to closed-source, multi-language, or new publication fields requires further exploration (Lin et al., 2024, Lin et al., 2022).
Integration in SE/peer review ecosystems: Seamless adoption in IDEs, CI/CD, and publisher platforms, along with real-time latency and adaptation to reviewer/author style, present ongoing system-level challenges (Hong et al., 13 Jun 2025, Olewicki et al., 2024).
Data scarcity and privacy: Efficient use of limited or proprietary review datasets while preserving privacy remains a practical concern.

Research consistently shows that advances in both data curation and training paradigms—not merely architectural scaling—are the principal drivers of progress in Review Comment Generation, with actionable, context-aware, and user-aligned feedback as the central target.