Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback (2506.03106v2)

Published 3 Jun 2025 in cs.CL and cs.AI

Abstract: Recent advances in reinforcement learning (RL) with numerical feedback, such as scalar rewards, have significantly enhanced the complex reasoning capabilities of LLMs. Despite this success, we identify three key challenges encountered by RL with solely numerical feedback: performance plateaus, limited effectiveness of self-reflection, and persistent failures. We then demonstrate that RL-finetuned models, even after exhibiting performance plateaus, can generate correct refinements on persistently failed problems by leveraging natural language feedback in the form of critiques. Building on this insight, we propose Critique-GRPO, an online RL framework that integrates both natural language and numerical feedback for effective policy optimization. Critique-GRPO enables LLMs to learn from initial responses and critique-guided refinements simultaneously while maintaining exploration. Extensive experiments using Qwen2.5-7B-Base and Qwen3-8B-Base show that Critique-GRPO consistently outperforms supervised learning-based and RL-based fine-tuning approaches across eight challenging mathematical, STEM, and general reasoning tasks, improving average pass@1 scores by approximately 4.5% and 5%, respectively. Notably, Critique-GRPO surpasses a strong baseline that incorporates expert demonstrations within online RL. Further analysis reveals two critical insights about policy exploration: (1) higher entropy does not always guarantee efficient learning from exploration, and (2) longer responses do not necessarily lead to more effective exploration.

PDF Abstract

The paper "Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback" (Zhang et al., 3 Jun 2025 ) introduces Critique-GRPO, an online reinforcement learning framework designed to improve the reasoning capabilities of LLMs by integrating both natural language feedback (in the form of critiques) and traditional numerical feedback (scalar rewards).

The authors identify three key limitations of existing RL-based LLM finetuning methods that rely solely on numerical feedback: performance plateaus, limited effectiveness of self-reflection, and persistent failures on certain problems. They hypothesize that the sparse information in scalar rewards, which doesn't explain why a response is correct or incorrect, hinders effective learning, especially for complex multi-step reasoning tasks. They observe that even RL-finetuned models that have plateaued can generate correct refinements for previously failed problems when provided with natural language critiques. This insight motivates the development of a framework that leverages richer feedback.

Critique-GRPO is proposed as an online framework structured in three main steps:

Initial Response Sampling: For a given question, the LLM samples a set of initial responses using its current policy. A reasoning-based reward model (like GPT-4o in their experiments) then evaluates these responses and generates Chain-of-Thought (CoT) critiques that explain the reasoning process and identify errors. Scalar rewards (e.g., binary correctness) are extracted from these critiques.
Refinement with Critique: The LLM is prompted to generate refined responses for the initially sampled responses, specifically conditioned on the original question, the initial response, and the generated CoT critique. These refined responses are also scored using the reward model or a rule-based evaluator. A subset of these refinements is selected (prioritizing correct ones) to be combined with the initial responses for training.
Online Policy Optimization: The LLM's policy is updated using a modified Group Relative Policy Optimization (GRPO) objective. This objective combines the learning signals from both the initial responses and the refined responses. A crucial component is a shaping function applied to the token-level probability ratios of the refined responses. This function emphasizes learning from tokens in correct refinements that were previously low-probability under the current policy, while strongly penalizing tokens in incorrect refinements. The standard clipping and KL divergence penalty terms from GRPO are removed to allow for more substantial policy updates guided by the critique-based refinements.

The core idea behind Critique-GRPO is to enable the model to learn not just from whether an attempt was successful (numerical reward), but also how to correct errors and improve the reasoning process (natural language critique and subsequent refinement). This online process allows the model to generate its own "better" examples to learn from, rather than relying solely on expert demonstrations (as in methods like LUFFY) or static refined data (as in offline supervised approaches).

The framework was evaluated on Qwen2.5-7B and Qwen3-8B models across eight challenging mathematical, scientific, and general reasoning benchmarks (including MATH, Minerva-Math, OlympiadBench, TheoremQA, GPQA-Diamond, MMLU-Pro, AIME 2024, AMC 2023). Experiments showed that Critique-GRPO consistently and significantly outperformed various supervised learning-based and other RL-based finetuning methods, achieving average pass@1 score improvements of approximately 4.5% and 5% over strong baselines like LUFFY for Qwen2.5-7B and Qwen3-8B, respectively. Using richer CoT critiques was shown to be more effective than simpler critiques containing only the ground truth.

Practical implementation involves integrating a critique generation model (like GPT-4o) into the RL loop, which adds computational cost compared to methods using only rule-based or simpler reward models. The efficiency of the critique model and the ability to generate high-quality critiques are critical considerations for scaling this approach. The paper suggests that the model's ability to learn from its own critique-guided refinements is a powerful mechanism for improving reasoning and generalization.

Further analysis revealed insights into policy exploration: higher policy entropy alone does not guarantee effective learning; the quality and relevance of the explored states (like those in effective refinements) are more important. Similarly, encouraging models to generate longer responses does not necessarily lead to better performance; Critique-GRPO achieved superior results while promoting more concise reasoning compared to baselines that produced longer outputs by imitating expert demonstrations.

In summary, Critique-GRPO offers a practical method to leverage the complementary strengths of numerical and natural language feedback within an online RL framework to effectively enhance LLM reasoning capabilities, particularly addressing the limitations of pure numerical reward signals.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Xiaoying Zhang (32 papers)
Hao Sun (383 papers)
Yipeng Zhang (42 papers)
Kaituo Feng (14 papers)
Chao Yang (333 papers)
Helen Meng (204 papers)
Chaochao Lu (39 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/bronzeagepapi/status/1930776953348624784

YouTube

Show All Videos