Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Teaching Language Models to Critique via Reinforcement Learning (2502.03492v1)

Published 5 Feb 2025 in cs.LG, cs.AI, and cs.CL

Abstract: Teaching LLMs to critique and refine their outputs is crucial for building systems that can iteratively improve, yet it is fundamentally limited by the ability to provide accurate judgments and actionable suggestions. In this work, we study LLM critics for code generation and propose $\texttt{CTRL}$, a framework for $\texttt{C}$ritic $\texttt{T}$raining via $\texttt{R}$einforcement $\texttt{L}$earning, which trains a critic model to generate feedback that maximizes correction performance for a fixed generator model without human supervision. Our results demonstrate that critics trained with $\texttt{CTRL}$ significantly enhance pass rates and mitigate compounding errors across both base and stronger generator models. Furthermore, we show that these critic models act as accurate generative reward models and enable test-time scaling through iterative critique-revision, achieving up to 106.1% relative improvements across challenging code generation benchmarks.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Zhihui Xie (17 papers)
  2. Jie Chen (602 papers)
  3. Liyu Chen (22 papers)
  4. Weichao Mao (11 papers)
  5. Jingjing Xu (80 papers)
  6. Lingpeng Kong (134 papers)

Summary

Teaching LLMs to Critique via Reinforcement Learning

This paper addresses the challenge of enabling LLMs to critique and refine their outputs effectively, focusing on the domain of code generation. The authors introduce a framework named Critic Training via Reinforcement Learning (CTRL), which is designed to train critic LLMs to generate feedback that improves the performance of a fixed generator model. The standout feature of CTRL is its ability to operate without human supervision, a significant advancement in the iterative improvement of AI-generated content.

The CTRL framework is structured around two core components: a generator model that proposes solutions and a critic model trained to provide feedback that enhances those solutions. This decoupled architecture not only boosts the generator's performance but also allows for test-time scaling through repeated critique-revision cycles. Remarkably, CTRL achieved up to a 106.1% relative improvement in code generation benchmarks, which underscores its efficacy.

CTRL introduces a rigorous training regimen that involves reinforcement learning (RL). The training process is divided into two stages: supervised fine-tuning (SFT) and RL through Group Relative Policy Optimization (GRPO). During SFT, high-quality critiques are synthesized by reasoning over execution feedback, allowing the model to internalize structured feedback strategies. Following this, RL is employed to refine these strategies further, reducing high variance in training through a group-based relative advantage approach.

The results of implementing CTRL are impressive, showcasing significant improvements over traditional self-critique methods and even outperforming approaches leveraging stronger critic models. Noteworthy is the framework's generalization ability, where a weaker critic model successfully guides a stronger generator to achieve correct outputs. This capability reflects the potential for scalable oversight in AI systems, where less capable models are effectively trained to supervise more powerful counterparts.

CTRL's empirical analysis highlights its ability to reduce error compounding during iterative improvements, which is a persistent challenge in self-improving systems. The framework significantly minimizes the regress from correct solutions to incorrect ones during revisions, illustrating its robustness in maintaining solution quality. Furthermore, the efficient test-time scaling achieved by the critic model demonstrates reduced token consumption and higher success rates in generating correct solutions, emphasizing the practical benefits of the approach.

In terms of implications, CTRL represents a step forward in the quest for more autonomous and reliable AI systems. Its application is not limited to code generation; the underlying principles and reinforcement learning techniques can potentially extend to other domains requiring iterative refinement. This research sets the stage for future developments in AI that prioritize feedback-driven improvements, merging discrimination and critiquing capabilities efficiently within LLMs.

Overall, this paper offers valuable insights into leveraging reinforcement learning for critic model training in AI systems. By disentangling the critiquing process from the solution generation, it opens up possibilities for scalable and autonomous model enhancements, laying a foundation for further exploration in self-improving AI technologies. Future research could build upon these findings, exploring broader applications and refining the integration between critic and generator models to enhance versatility and performance across various tasks.