Critique-Guided Improvement (CGI)

Updated 1 April 2026

Critique-Guided Improvement (CGI) is an iterative approach where models critique and correct their outputs, enhancing accuracy in tasks like reasoning and code generation.
It employs methods such as supervised critique modeling, actor–critic loops, and reinforcement learning to refine candidate solutions using detailed error feedback.
Empirical results demonstrate significant gains in mathematical reasoning, safety evaluations, and general language tasks, making CGI a robust strategy for LLM improvement.

Critique-Guided Improvement (CGI) is a paradigmatic approach in machine learning—especially in LLMs—in which the process of iterative critique generation and resolution is explicitly positioned as the driver of model capability enhancement. Rather than relying solely on imitation of reference outputs or scalar numerical rewards, CGI leverages models that can diagnose, articulate, and correct the shortcomings of candidate solutions, yielding performance gains in reasoning, generation, and alignment across diverse domains.

1. Formalization of Critique-Guided Improvement

At the core of CGI is the notion that a model, or pair of models, can best improve by learning to analyze, critique, and revise candidate outputs. Distinct from conventional supervised fine-tuning (SFT), which minimizes the likelihood of gold responses $y^*$ given input $x$ :

$L_{\rm SFT}(\theta) = -\mathbb{E}_{(x,y^*)}[\log P_\theta(y^*|x)],$

CGI-based methods introduce an additional modeling objective. In the case of Critique Fine-Tuning (CFT), the model receives noisy or imperfect candidate outputs $y$ and is trained to generate a detailed, often stepwise, critique $c$ : $L_{\rm CFT}(\theta) = -\mathbb{E}_{(x,y,c)}[\log P_\theta(c\,|\, [x;y])].$

The distinguishing characteristic of CGI is the centrality of error analysis and critical feedback as the primary training signal. The resulting internal representations emphasize detection of flawed reasoning and proposal of corrections, supporting transfer to generative tasks and forming the foundation for data-efficient and robust improvements in LLM reasoning (Wang et al., 29 Jan 2025).

2. Methods and Workflow: Architectures and Learning Algorithms

CGI is realized in several algorithmic forms, the most prominent of which are:

Supervised Critique Modeling: A model is trained to critique candidate responses using datasets of (prompt, candidate, critique) triples, typically obtained via stronger LLMs as teachers or via human annotation (Wang et al., 29 Jan 2025).
Two-Player Actor–Critic Loops: An “actor” model (generator) proposes a candidate solution; a “critic” model generates feedback; the actor refines its solution conditioned on this feedback. This can be performed during inference (test-time iterative refinement) or during self-training to collect synthetic supervision (Xi et al., 2024, Yang et al., 20 Mar 2025).
Critique-Conditioned Distillation: A student model is trained to map (prompt, candidate, critique) to high-quality answers distilled from a strong teacher, often integrating critiques into the conditional context (Kapusuzoglu et al., 16 May 2025). The CGD loss for a student $S_\theta$ becomes: $\mathcal{L}_{\rm CGD}(\theta) = \mathbb{E}_{(x,y',c,\hat{y})}[-\log S_\theta(\hat{y}\mid x,y',c)].$
Reinforcement Learning for Critics: A critic model is optimized via RL to provide feedback that maximally increases downstream performance of a fixed generator, with reward tied to the probability that critique-guided revisions pass all test cases (Xie et al., 5 Feb 2025, Ruan et al., 26 Sep 2025, Xi et al., 28 Oct 2025).
Refinement-Oriented Critique Optimization: The critic is directly rewarded for generating critiques that lead to refinements preferred to the original output under an automated judge, optimizing the “Critique Utility” (CU) (Yu et al., 27 Jun 2025): $\mathrm{CU}(c_i \mid y_0, x) = \frac{1}{M} \sum_{j=1}^M \mathrm{PS}(y_{ij},y_0),$ where PS is a learned or LLM-computed pairwise preference.
Self-Evolutionary Critique: Models generate, validate, and learn from their own critiques and corrections—filtered to maintain quality without any stronger supervisor (Tang et al., 10 Jan 2025).

Most CGI frameworks support both single-pass (when training the generator to utilize critiques is feasible) and multi-turn iterative refinement (when repeated critique–edit cycles are desired).

3. Dataset Construction and Critique Signal Quality

Central to CGI is the construction of datasets in which critiques are both informative and actionable. Techniques include:

Teacher-Annotated Critiques: Datasets such as WebInstruct-CFT, MetaMath-CFT, NuminaMath-CFT use teacher models (e.g., GPT-4o) to generate critiques by prompting on (question, noisy response) pairs. Critiques identify errors, suggest corrections, and sometimes assign a correctness label (Wang et al., 29 Jan 2025).
Synthetic Critique Generation: In fully self-supervised settings, as in SCRIT, the model uses reference solutions to perform contrastive, step-wise error identification and generates corrections, which are filtered via a self-validation step (Tang et al., 10 Jan 2025).
Step-Level Feedback via Automated Pipelines: Datasets such as MathCritique-76k pair every intermediate step in a chain-of-thought derivation with targeted feedback, supporting step-level supervision (Xi et al., 2024).
Multi-Aspect Critiques: In tasks such as prompt optimization, LLMs are prompted to autonomously discover aspects (e.g., correctness, style, factuality) and to provide critique-suggestion tuples per aspect (He et al., 2024).
Automated or Meta-Evaluation of Critiques: Systems such as Safety-J validate critique quality through AIU (Atomic Information Unit) precision, recall, and F1, enabling preference optimization and iterative refinement (Liu et al., 2024).

A persistent limitation is the prevalence of “noisy” critiques, especially when teacher LLMs generate critiquing signals. Error rates on raw, unfiltered critiques are typically 20–25%, motivating the use of multi-annotator or self-validation schemes (Wang et al., 29 Jan 2025, Tang et al., 10 Jan 2025).

4. Empirical Impact and Comparative Evaluations

CGI has yielded extensive empirical improvements across domains:

Mathematical Reasoning: CFT produces +4–11 point average gains over SFT on MATH, Minerva, GSM8K, and OlympiadBench. In AMC23, absolute gains reach +17.5 points with the CGD variant (Wang et al., 29 Jan 2025, Kapusuzoglu et al., 16 May 2025).
Code Generation: Critique-trained critics in CTRL yield up to +106.1% relative improvement in pass@1 on CodeContests, transferring gains even to stronger generators (Xie et al., 5 Feb 2025). Critique-Coder establishes consistent +2–4 point average lifts over RL-only baselines across HumanEval, MBPP, and LiveCodeBench (Ruan et al., 26 Sep 2025).
General Reasoning: Application to instruction following and language understanding (e.g., MMLU-Pro) yields +6.3 point gains with CGD (Kapusuzoglu et al., 16 May 2025). CGI is compatible with domains such as summarization, QA, and STEM problem solving.
Safety and Alignment: Safety-J shows that iterative CGI with critique-based meta-evaluation produces more robust and nuanced safety assessments than baselines, improving fine-grained predictive reliability (Liu et al., 2024).
Agent Decision-Making: In embodied or interactive environments (WebShop, Science World, TextCraft), CGI-based agents surpass LLM verifiers, reward models, and iterative self-improvement baselines by 10–29 points, especially when using a dedicated critic generating granular multi-dimensional feedback (Yang et al., 20 Mar 2025).
Scalability and Data Efficiency: CFT, CGD, and RCO achieve state-of-the-art or near-optimal results while using 1–2 orders of magnitude less data and compute than traditional imitation or reward-based strategies (Wang et al., 29 Jan 2025, Kapusuzoglu et al., 16 May 2025, Yu et al., 27 Jun 2025).

Table: CGI vs. Non-CGI Supervision in Math Reasoning (Qwen2.5-7B, average across six tasks) (Wang et al., 29 Jan 2025):

Method	MATH	Minerva	GSM8K	Olympiad	AIME24	AMC23	AVG
SFT	61.5	16.2	70.8	30.1	13.3	37.5	38.2
CFT	71.1	27.9	88.8	35.7	13.3	55.0	48.6
Δ	+9.6	+9.5	+11.4	+5.6	0.0	+17.5	+10.4

5. Ablation Studies and Mechanistic Analysis

Ablations consistently indicate that:

Quality and Source of Critiques: Higher-quality or reference-sourced noisy outputs for critique yield larger downstream improvements. Models are robust but benefit most from critiques derived via stronger LLMs (Wang et al., 29 Jan 2025).
Teacher Model Strength: Weaker teacher critics (e.g., GPT-4o-mini) still offer non-trivial CGI gains, but peak performance follows teacher strength (Wang et al., 29 Jan 2025).
Mixing Ratio in RL: In RL settings, moderate interleaving of critique-level and solution-level reward (e.g., 20% CRL in Critique-Coder) maximizes performance; excessive focus on either degrades result quality or diversity (Ruan et al., 26 Sep 2025).
Test-Time Scaling: Increasing the number of sampled generations and corresponding critiques at inference-time raises the performance ceiling, especially for hard problem bins (Xi et al., 2024).
Self-Critique Limitation: In most empirical studies, naive self-critique at inference does not outperform direct decoding unless combined with explicit, dedicated critic models or multi-critic loops (Wang et al., 29 Jan 2025, Ruan et al., 26 Sep 2025).

6. Limitations, Extensions, and Broader Significance

Known limitations of current CGI workflows include:

Critique Quality Noise: Up to 25% of automated critiques may contain misanalyses, requiring future work on filtering, voting, or multi-teacher schemes (Wang et al., 29 Jan 2025).
Self-Reference and Drift: Fully self-evolving critics (e.g., SCRIT) depend on rigorous self-validation to prevent degenerate or drifted supervision (Tang et al., 10 Jan 2025).
Generalization: Most gains are measured in domains with readily evaluable responses (math, code, summarization); performance in open-ended or ethical contexts is less studied.
Actor–Critic Coupling: Most frameworks (except RCO) keep the actor fixed during critic training, potentially limiting joint optimization of critique generation and utilization (Yu et al., 27 Jun 2025).

Extensions under active investigation:

Hierarchical/Recursive Critiquing: Multi-round or hierarchical critique-revision, including critique-of-critique, is a developing area (Xi et al., 28 Oct 2025).
Contrastive or Preference-Based Learning: Integrating DPO, contrastive objectives, or meta-evaluations to select and reinforce the most informative critiques (Liu et al., 2024, Kapusuzoglu et al., 16 May 2025).
Applications Beyond Reasoning: Legal, scientific, and multimodal generation are identified as promising domains for CGI expansion (Wang et al., 29 Jan 2025).
Human-in-the-Loop Oversight Integration: Selective curation or active integration of human feedback for safety-critical applications (Yu et al., 27 Jun 2025).

CGI is now established as a principal methodology for closing the gap between rote imitation and robust, context-aware reasoning in LLMs, delivering substantial empirical gains across tasks while reducing data and compute requirements. The approach’s emphasis on actionable feedback, error localization, and iterative correction marks a distinct shift toward explanatory, theory-driven, and scalable self-improvement in large-scale machine learning (Wang et al., 29 Jan 2025, Xie et al., 5 Feb 2025, Ruan et al., 26 Sep 2025, Xi et al., 2024, Zhang et al., 3 Jun 2025, Kapusuzoglu et al., 16 May 2025, Xi et al., 28 Oct 2025, Yu et al., 27 Jun 2025, Yang et al., 20 Mar 2025).