Papers
Topics
Authors
Recent
Search
2000 character limit reached

Critique-RL: Enhancing RL via Model Critique

Updated 4 November 2025
  • Critique-RL is a reinforcement learning framework that trains models as critics to evaluate outputs by providing binary judgments and detailed feedback.
  • It employs hybrid methodologies combining direct feedback, two-stage RL, and natural language critique to optimize both generation and evaluation.
  • Empirical results demonstrate improved accuracy, sample efficiency, and transferability across tasks like mathematical reasoning and code generation.

Critique-RL refers to a class of reinforcement learning (RL) methodologies and frameworks that train LLMs or agents to function as critics—entities that assess, provide feedback on, and verify the quality of outputs generated by themselves or other models. The core aim is to develop models that possess robust capabilities for evaluating, reflecting, and improving upon generated outputs, especially in complex reasoning domains. Critique-RL diverges from standard reinforcement learning focused solely on generation, instead optimizing explicit critique ability as a first-class objective. Over the past two years, Critique-RL and related frameworks have seen significant developments, notably in mathematical reasoning, code generation, feedback/oversight for LLMs, and broader alignment applications.

1. Foundations and Rationale

Traditional RL fine-tuning for LLMs centers on optimizing the generation policy to maximize average reward across outputs—often using scalar reward models, direct metric evaluation (e.g., test cases in code), or preference feedback. However, this approach does not directly incentivize models to acquire or exercise critique abilities: skills in detecting flaws, providing actionable feedback, or verifying correctness in generated solutions. Several limitations of standard RL fine-tuning motivate Critique-RL:

  • Sparse Rewards and Poor Credit Assignment: Binary or scalar rewards give little information for reasoning improvement, leading to plateaued performance and persistent failure on hard tasks (Zhang et al., 3 Jun 2025, Cao et al., 2024).
  • Lack of Explicit Reflection: RL optimization over solutions alone does not inherently train the model to explain, diagnose, and suggest improvements.
  • Oversight and Alignment: As models are deployed in ever more complex reasoning and decision-making tasks, scalable, automated oversight (beyond a basic reward signal) is required for both alignment and capability robustness.

Critique-RL frameworks are constructed to incentivize LLMs to develop meta-cognitive capabilities—providing high-quality judgments and actionable critiques that drive further improvement and support scalable self-improvement loops.

2. Methodological Paradigms

Critique-RL is implemented in multiple forms, which may overlap—key axes of distinction include the training objective, the nature of the reward signal, and the division of roles between generation (actor) and critique (critic).

2.1. Critique Reinforcement Learning with Binary Judgments

As instantiated in Critique-Coder (Ruan et al., 26 Sep 2025), Critique Reinforcement Learning (CRL) trains models to output a critique for a given (question,solution)(\text{question}, \text{solution}) pair, with the explicit task of assigning a binary judgment (correct/incorrect). The reward is 1 if the model’s final judgment matches the ground-truth judgment (c=cc = c^*), 0 otherwise: Rcrl(c,c)={1c=c 0otherwiseR_{\text{crl}}(c, c^*) = \begin{cases} 1 & c = c^* \ 0 & \text{otherwise} \end{cases} This is mixed in a hybrid RL setting—typically, 80% of the training data uses standard RL (generating solutions, reward from test cases), 20% uses CRL critique data. The GRPO (Group Relative Policy Optimization) algorithm is used for stable joint optimization.

2.2. Two-Stage Reinforcement Learning for Critic Training

The Critique-RL framework (Xi et al., 28 Oct 2025) adopts a two-stage RL process for developing language critics:

  • Stage I (Direct Discriminability Optimization): The critic is directly rewarded for correct binary judgments using a rule-based reward aligned to an oracle (ground truth), with KL regularization to prevent catastrophic drift from the SFT prior.
  • Stage II (Helpfulness Optimization with Discriminability Regularization): The critic is further trained to maximize a reward given by whether the actor, after using the critic's feedback, reaches a correct solution. Crucially, the discriminability reward is maintained with a regularization term, ensuring that feedback-improving helpfulness does not come at the expense of accurate judgment.

2.3. RL with Both Numeric and Natural Language Critique

Critique-GRPO (Zhang et al., 3 Jun 2025) and similar frameworks recognize the limits of pure scalar feedback and introduce joint learning from both numerical and natural language (critique) rewards. Models are trained using online RL to not only optimize for the correctness of initial generations (scalar reward), but also to learn efficiently from critique-guided refinements, with explicit policy shaping to prioritize learning from rare, informative corrections.

2.4. RL for Critique Generation

Methods such as CTRL (Xie et al., 5 Feb 2025), DeepCritic (Yang et al., 1 May 2025), and MultiCritique (Lan et al., 2024) separate the critic from the actor/generator. The critic model is trained via RL (often using proxy supervision, step labels, or preference pairs) to output natural language feedback that maximizes downstream correction (as measured by a fixed or improved generator) without direct human supervision.

3. Experimental Results and Benchmarks

Critique-RL methodologies have been rigorously benchmarked across mathematical reasoning, code generation, and general reasoning tasks. Experimental findings show:

  • Consistent Performance Improvements: Hybrid Critique-RL and RL-trained critics yield significant gains over pure RL or SFT, both in pass rates (accuracy) and robustness to compounding errors or stale failures (Ruan et al., 26 Sep 2025, Xi et al., 28 Oct 2025, Zhang et al., 3 Jun 2025).
  • Transferability: Critique-optimized models (trained on code, for instance) transfer their gains to logical, math, or unrelated reasoning tasks (e.g., improved BBEH results).
  • Sample Efficiency: Critique supervision can drastically outperform SFT and RL in compute efficiency, sometimes matching RLVR with an order of magnitude less computation (Wang et al., 3 Jun 2025).
  • Qualitative Gains: Models trained with Critique-RL produce longer, richer reasoning chains and more actionable/explanatory feedback.

Representative Numerical Results (Qwen3-8B, LiveCodeBench):

Model Pass@1 (%)
RL-only 58.4
Critique-Coder (CRL+RL, 20%) 60.8
DeepCoder-14B 60.6
GPT-o1 59.2

For general reasoning (BBEH, Qwen3-4B):

  • Critique-Coder: Averaged 52.8% (+6.1 pts over baseline, +4 over RL-only).

Ablation studies consistently find that mixing critique and RL signals—e.g., ~20% Critique-RL—provides the best results, with pure Critique-only or RL-only performing worse.

4. Mechanistic Insights and Design Principles

4.1 Explicit Critical Thinking Incentivization

Critique-RL directly incentivizes critical thinking and error analysis by training models to simulate the act of evaluating solutions, not just producing them. The separation of discriminability (correct identification of solutions' correctness) from helpfulness (ability of critique to inform improvement) enables explicit optimization along both axes (Xi et al., 28 Oct 2025).

4.2 Complementarity to RL

CRL alone leads to models with poor direct solution generation; RL alone leads to models with poor critique. Only the hybrid optimally balances these outcomes. Furthermore, Critique-RL can serve as “weak-to-strong” scalable supervision: weaker critics trained with RL can be used to efficiently guide more powerful actors (Xie et al., 5 Feb 2025).

4.3 Improved Exploration and Learning Signal

Integrating critique with online RL, especially when using natural language CoT feedback, enables models to escape performance plateaus and resolve persistent failures, providing targeted improvement pathways not accessible via scalar rewards (Zhang et al., 3 Jun 2025).

4.4 Modular and Generalizable Architecture

Once trained, a Critique-RL critic can be reused with different actors or for different tasks, supporting modular system design and broad generalization (Xi et al., 28 Oct 2025).

5. Practical Limitations and Outlook

Despite strong empirical validation, several limitations and open issues remain:

  • Full Self-Critique Not Emergent: Simple test-time usage of critique outputs for solution selection (self-critique) does not fully close the performance gap with true ensemble selection or majority voting, indicating the need for further research in test-time scaling (Ruan et al., 26 Sep 2025).
  • Limited by Oracle Signal Availability: Current methods require access to ground-truth or a reliable reward indicator at training time; extending to domains lacking such oracles will require new techniques.
  • Extensibility to Non-Code Domains: While code and math domains (with verifiable execution) demonstrate strong results, generalization to open-ended, less structured reasoning or dialogue domains demands robust adaptation.
  • RL Stability and Tuning: Multi-stage RL training with discriminability regularization must be managed to avoid collapse or loss of judgment ability (Xi et al., 28 Oct 2025).

6. Comparative Table: Critique-RL vs Prior Approaches

Method Helpfulness via Actor Refinement Direct Judgment Optimization Needs Oracle at Test Time OOD Generalization Efficient Oversight Modular Critic
SFT Weak
Retroformer/CTRL Moderate Moderate
Prompt/Reflection Poor
Critique-RL Strong High

7. Conclusions and Emerging Directions

Critique-RL has established itself as a principled and effective framework for training critic-capable LLMs—enabling robust, actionable evaluation, correction, and improvement of outputs for complex reasoning tasks. Its core advances lie in explicitly incentivizing both discriminability and helpfulness, leveraging feedback-rich learning signals, and providing strong empirical evidence for transferability and sample efficiency. As LLMs are increasingly deployed in safety-critical and high-stakes reasoning contexts, Critique-RL provides a scalable path to automated, trustworthy oversight and continual self-improvement.

Recent extensions include multi-agent critique aggregation for enhanced label quality (Lan et al., 2024), explicit integration of verifiable tool feedback in RL (Kapadnis et al., 30 May 2025), and compositional reward designs for tasks like peer review (Zeng et al., 14 Aug 2025). Open challenges include full self-critique emergence, reward model generalization, and application in domains with weak or absent ground-truth metrics. As Critique-RL frameworks diversify and mature, they are expected to underlie robust, modular, and autonomous reasoning systems across a range of scientific, engineering, and societal applications.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Critique-RL.