Policy Refinement by Critique

Updated 15 April 2026

Policy refinement by critique is a paradigm where models use explicit natural-language feedback to iteratively improve responses and overall alignment.
It unifies advances in RLHF, supervised distillation, and self-improving systems through multi-stage actor-critic interactions and composite learning objectives.
Empirical evaluations show that this approach consistently outperforms direct reward modeling, delivering measurable gains across reasoning, coding, and summarization benchmarks.

Policy refinement by critique is a paradigm in LLM training wherein model policies are improved not simply by attaching gradient signals to a scalar reward, but by leveraging explicit, often natural-language, critiques as intermediate artifacts. These critiques can be generated by the model itself or a secondary module ("critic") and are then actively used to guide response refinement, provide supervision to reward models, or structure learning objectives for both policy and critic components. This approach unifies several recent advances in RLHF, supervised distillation, and self-improving systems by incorporating explicit feedback loops that enable models to "think out loud" and self-correct, thus achieving levels of performance and alignment more robust than possible with direct preference modeling alone (Ankner et al., 2024, Xi et al., 28 Oct 2025, Tang et al., 20 Jul 2025).

1. Core Mechanisms and Architectural Patterns

Policy refinement by critique characteristically operates via a multi-stage interaction between a "policy" (actor) and a "critic" (which may itself be a LLM or a specialized module).

Critique generation: Following a candidate response $y \sim \pi(\cdot|x)$ to a prompt $x$ , a critic model produces a free-form, structured, or chain-of-thought critique $c$ , which may include both discriminative evaluations and actionable suggestions.
Refinement: The policy model conditions on $(x, y, c)$ to generate an improved response $y' \sim \pi(\cdot|x,y,c)$ .
Feedback integration: These $(x, y, c, y')$ tuples feed into supervised, preference-based, or reinforcement learning objectives, with the ultimate aim of shaping the policy to be more aligned, accurate, or helpful on future rollouts.

This loop can be instantiated in both supervised (SFT or distillation) and RL settings, and supports both one-off and iterative improvement protocols (Kapusuzoglu et al., 16 May 2025, Hu et al., 5 Dec 2025, Zhou et al., 13 Feb 2025).

2. Formal Training Objectives and Algorithms

Several formalizations underlie contemporary critique-based methods:

Critique-out-Loud (CLoud): Models combine a language modeling head to produce $c \sim p(c|x,y;\theta)$ and a reward head predicting $r_\theta(x,y,c)$ , training on a loss

$\mathcal{L}_{\rm CLoud} = \mathcal{L}_{\rm RM} + \lambda \mathcal{L}_{\rm SFT}$

where $\mathcal{L}_{\rm SFT}$ is cross-entropy on critiques and $x$ 0 is pairwise Bradley–Terry preference loss over $x$ 1 (Ankner et al., 2024).

Critique-RL: A two-stage RL objective. First, the critic $x$ 2 is optimized for discriminability with reward $x$ 3 matching a verifier. Then the critic is tuned for helpfulness via actor refinement rewards $x$ 4, regularized to preserve discrimination:

$x$ 5

The actor is supervised to refine $x$ 6 into $x$ 7 given $x$ 8, but is not updated during the RL training of the critic (Xi et al., 28 Oct 2025).

RefCritic: Critic is RL-trained with both instance-level correctness reward $x$ 9 and refinement accuracy $c$ 0 for actionable feedback. Only the critic is updated; the policy is refined online via injected critiques at test/inference time (Tang et al., 20 Jul 2025).
Critique-guided methods in SFT/distillation: CGD uses teacher-generated explanatory critiques as conditioning signals, training the student to map $c$ 1, where $c$ 2 is its pre-finetuning response and $c$ 3 the teacher's critique (Kapusuzoglu et al., 16 May 2025). SCRPO collects $c$ 4 preference triples—where $c$ 5 is the refined response after critique/refinement and $c$ 6 is the original—and optimizes a DPO-style ranking loss (Hu et al., 5 Dec 2025).
RCO (Refinement-oriented Critique Optimization): Critic models are updated such that the critiques they generate maximize the probability that the actor's refinement (conditioned on the critique) will be strictly preferred over the original, as determined by an external judge. The utility $c$ 7 of a critique $c$ 8 quantifies the expected quality improvement downstream, and forms the reward for critic learning (Yu et al., 27 Jun 2025).

3. Empirical Gains, Benchmark Outcomes, and Ablations

Policy refinement by critique consistently outperforms direct reward modeling or imitation learning across established reasoning, code, and summarization benchmarks:

Method (Dataset/Model)	Baseline	Critique-Refinement	Gain	Source
RewardBench (pairwise acc, 8B)	Classic RM 72.5%	CLoud 77.2%	+4.65 pp	(Ankner et al., 2024)
AIME25 (pass@1, Qwen-14B)	14.4%	RefCritic 21.2%	+6.8 pp	(Tang et al., 20 Jul 2025)
PersonaFeedback (Qwen2.5-7B/PPO)	53.1%	Critique-Post-Edit 64.1%	+11.0 pp	(Zhu et al., 21 Oct 2025)
MATH (Acc@Refine, Qwen2.5-7B)	45.74% (no critic)	Critique-RL 58.40%	+12.66 pp	(Xi et al., 28 Oct 2025)
BBH (accuracy, LLaMA-2-7B)	39.67	RCO 48.06	+8.39	(Yu et al., 27 Jun 2025)
HumanEval (RefineCoder-DS-6.7B)	73.8 (iter 0)	75.0 (iter 3, +1.2)	+1.2	(Zhou et al., 13 Feb 2025)

Ablations demonstrate that omitting the critique-refine step (e.g., D_refine in (Yang et al., 20 Mar 2025), or selective critic module in (Zhou et al., 13 Feb 2025)) significantly degrades performance, highlighting the causal role of actionable critiques. Self-consistency and critique-sampling at inference can further boost accuracy for short-horizon tasks (Ankner et al., 2024); refinement-based supervision is robust to different judge models and operates across model sizes (Yu et al., 27 Jun 2025).

4. Critique Types, Utility Estimation, and Training Strategies

Types of critiques: Approaches variously use binary judgments, fine-grained or fact-level explanations (Hu et al., 5 Dec 2025), multi-dimensional critiques (helpfulness, personalization, naturalness) (Zhu et al., 21 Oct 2025), or long chain-of-thought traces (Tang et al., 20 Jul 2025).
Utility of critiques: The primary criterion is observed downstream improvement of actor outputs; utility can be explicitly quantified via Critique Utility (CU) or other paired-preference metrics (Yu et al., 27 Jun 2025).
Supervision signals: Both direct (correctness matching, explicit refinement success) and indirect (DPO/ranking objectives, offline preference data) signals are used.
Critique generation: Training may rely on manually curated traces, distillation from stronger models, self-generated critiques, or online RL-based optimization (Hu et al., 5 Dec 2025, Kapusuzoglu et al., 16 May 2025, Tang et al., 20 Jul 2025).
Policy updating/coupling: In some frameworks, only the critic is updated and the actor is refined at inference (RefCritic, RCO); in others, actor and critic are updated jointly, or the actor is iteratively SFTed to incorporate feedback (Yang et al., 20 Mar 2025, Hu et al., 5 Dec 2025).

5. Extensions, Limitations, and Open Challenges

Extensions: Proposed directions include learning multi-objective reward heads to decompose aspects of critique (Ankner et al., 2024), integrating human-written rubrics (Ankner et al., 2024), explicit mixture-of-experts for different sub-tasks (Ankner et al., 2024), and extending to multi-turn dialogue and agent chains (Ankner et al., 2024), as well as interactive online RLHF loops with critique-guided PPO updates (Xi et al., 28 Oct 2025).
Limitations: Effectiveness of on-policy self-generated critiques depends on the initial critic's quality; off-policy or externally sourced critiques can degrade learning. Critique sampling and self-consistency are effective only for short-horizon reasoning or summarization tasks. Additional computational overhead from decoding critiques may impact efficiency (Ankner et al., 2024). Single-round refinement may limit performance versus multi-turn iterative refinement; actor models may remain sub-optimal if not co-trained with critiques (Yu et al., 27 Jun 2025).
Generalization: The critique-refinement loop has been validated in code (RefineCoder), math (Critique-GRPO, RefCritic), instruction following, dialogue, and summarization (SCRPO). A plausible implication is that any domain where outputs are amenable to actionable feedback and subsequent revision can benefit from this paradigm (Zhou et al., 13 Feb 2025, Hu et al., 5 Dec 2025).

6. Comparison with Classical RLHF and Distillation Protocols

The critique-refinement paradigm systematically addresses key failure modes of reward hacking and imitation learning found in direct RLHF and vanilla SFT/distillation:

Superficial alignment: Scalar reward models can be gamed (reward hacking) and do not provide actionable error signals; critiques articulate missing reasoning and style flaws explicitly (Zhu et al., 21 Oct 2025).
Imitation without understanding: SFT on correct responses alone leads to high entropy but shallow understanding (Kapusuzoglu et al., 16 May 2025); critique-guided learning reduces uncertainty and sharpens the model's posterior over correct outputs.
Exploration/exploitation: Critique-driven updates favor rare but high-quality refinements over random exploration; longer or higher-entropy outputs alone do not yield better improvements (Zhang et al., 3 Jun 2025).
Format preservation: Critique-based distillation mitigates format drift and overfitting compared to critique-only fine-tuning (Kapusuzoglu et al., 16 May 2025).

In summary, policy refinement by critique integrates explanatory feedback as a central learning signal, structurally outperforming direct reward modeling or imitation by enabling LLMs to internalize both "what to do" and "why," resulting in measurable advances in alignment, faithfulness, and reasoning. The approach is characterized by explicit feedback loops, composite supervision, and strong empirical validation across tasks (Ankner et al., 2024, Yang et al., 20 Mar 2025, Tang et al., 20 Jul 2025).