Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Self-rewarding correction for mathematical reasoning (2502.19613v1)

Published 26 Feb 2025 in cs.AI and cs.LG

Abstract: We study self-rewarding reasoning LLMs, which can simultaneously generate step-by-step reasoning and evaluate the correctness of their outputs during the inference time-without external feedback. This integrated approach allows a single model to independently guide its reasoning process, offering computational advantages for model deployment. We particularly focus on the representative task of self-correction, where models autonomously detect errors in their responses, revise outputs, and decide when to terminate iterative refinement loops. To enable this, we propose a two-staged algorithmic framework for constructing self-rewarding reasoning models using only self-generated data. In the first stage, we employ sequential rejection sampling to synthesize long chain-of-thought trajectories that incorporate both self-rewarding and self-correction mechanisms. Fine-tuning models on these curated data allows them to learn the patterns of self-rewarding and self-correction. In the second stage, we further enhance the models' ability to assess response accuracy and refine outputs through reinforcement learning with rule-based signals. Experiments with Llama-3 and Qwen-2.5 demonstrate that our approach surpasses intrinsic self-correction capabilities and achieves performance comparable to systems that rely on external reward models.

The paper introduces a novel self-rewarding reasoning framework for LLMs that integrates generation and evaluation into a single model. This framework aims to enhance the self-correction capabilities of LLMs in mathematical reasoning tasks, reducing computational overhead compared to approaches relying on external reward models.

The key contributions of the paper are:

  • A self-rewarding reasoning framework integrating the generator and reward model into a single LLM, enabling autonomous reasoning, evaluation, and correction.
  • A two-stage algorithmic framework for self-correction in mathematical reasoning, relying only on self-generated data. The first stage uses sequential rejection sampling to construct long chain-of-thought (CoT) trajectories encoding self-rewarding and self-correction behaviors. The second stage enhances these behaviors through reinforcement learning with rule-based signals.
  • Empirical validation demonstrating that self-rewarding correction significantly outperforms intrinsic self-correction.

The self-rewarding reasoning process is formulated as a multi-turn Markov Decision Process (MDP). An LLM generates an initial reasoning attempt a1π1(s1)a_1 \sim \pi_1(\cdot | s_1) given a prompt s1=xXs_1 = x \in \mathcal{X} from a distribution D0\mathcal{D}_0, where π\pi is the LLM. It then self-rewards its response by generating an evaluation y1π1(s1,a1)y_1 \sim \pi_1(\cdot | s_1, a_1). If the model assesses its answer as correct (y1y_1 = [VERIFY] correct), the generation stops. Otherwise, the LLM generates a refined response and evaluation (a2,y2)π2(s2)(a_2, y_2) \sim \pi_2(\cdot | s_2), conditioned on the updated state s2=(s1,a1,y1)s_2 = (s_1, a_1, y_1). The self-refinement continues until the model produces a self-evaluation yhy_h assessing the answer as correct.

y1π1(s1,a1)y_1 \sim \pi_1(\cdot | s_1, a_1)

s2=(s1,a1,y1)s_2 = (s_1, a_1, y_1)

where:

  • s1s_1 is the initial prompt.
  • a1a_1 is the initial reasoning attempt.
  • π1\pi_1 is the LLM.
  • y1y_1 is the self-rewarding evaluation.

The two-stage training framework consists of:

  1. Self-rewarding instruction-following fine-tuning (IFT): An initial LLM π0\pi_0 is fine-tuned using demonstration data collected via sequential rejection sampling, resulting in an improved model πref\pi_{\text{ref}} integrating self-rewarding reasoning abilities.
  2. Reinforcement learning (RL) optimization: πref\pi_{\text{ref}} is further refined using RL, leveraging it as the reference model. This stage enhances the model's ability to assess correctness and refine responses.

The self-rewarding signal is trained by token prediction, where models include reasoning in their evaluations and output specific tokens to indicate their evaluation results, such as "[VERIFY] correct" and "[VERIFY] wrong". Data collection uses a rejection sampling approach, generating self-correction trajectories and preserving desired ones. The process includes generating initial reasoning responses, sampling self-rewarding signals, and correction sampling. The LLMs are fine-tuned using a standard SFT pipeline to maximize:

EDIFT[logP(y1x,a1)+logP(a2x,a1,y1)]+EDIFT[logP(a2x,a1,y1)]+EDIFT[logP(y1x,a1)]\mathbb{E}_{\mathcal{D}_{IFT}} [\log P(y_1|x, a_1) + \log P(a_2|x, a_1, y_1)] + \mathbb{E}_{\mathcal{D}_{IFT}} [\log P(a_2|x, a_1, y_1)] + \mathbb{E}_{\mathcal{D}_{IFT}} [\log P(y_1|x, a_1)]

where:

  • xx is the initial prompt.
  • a1a_1 is the initial reasoning attempt.
  • y1y_1 is the self-rewarding evaluation of the first turn.
  • a2a_2 is the revised reasoning attempt.

For the RL stage, the paper considers both deep RL methods and direct alignment algorithms. A trajectory-wise reward function u(τ)u^*(\tau) is used for trajectory τ=(x,a1,y1,,aH,yH)\tau = (x, a_1, y_1, \dots, a_H, y_H), where HH is the horizon. The oracle reward u(τ)=r(x,aH)u^*(\tau) = r^*(x, a_H) is used, where rr^* is the ground-truth verifier. The KL-regularized objective is:

maxπExD0,a1π0(x)Eτπ(x,a1)h=1Hr(τ)ηDKL(πh(sh),πref(sh))\max_{\pi} \mathbb{E}_{x \sim \mathcal{D}_0, a_1 \sim \pi_0(\cdot | x)} \mathbb{E}_{\tau \sim \pi(\cdot | x, a_1)} \sum_{h=1}^{H} r^*(\tau) - \eta D_{KL}(\pi_h (\cdot | s_h), \pi_{\text{ref}} (\cdot | s_h))

where:

  • π\pi is the policy being optimized.
  • π0\pi_0 is the initial LLM.
  • πref\pi_{\text{ref}} is the reference model.
  • r(τ)r^*(\tau) is the trajectory-wise reward.
  • DKLD_{KL} is the Kullback-Leibler divergence.
  • η\eta is a regularization coefficient.

The paper also adopts Direct Preference Optimization (DPO) to solve the equation, using the multi-turn DPO (M-DPO) framework. The loss function CM-DPO(θ)C_{\text{M-DPO}}(\theta) is:

E(τw,τl)Dlogσ(η[logTθ(y1wx,a1)Tref(y1wx,a1)logTθ(y1lx,a1)Tref(y1lx,a1)+h=1HlogTθ(ahw,yhwsh)Tref(ahw,yhwsh)logTθ(ahl,yhlsh)Tref(ahl,yhlsh)])- \mathbb{E}_{(\tau^w, \tau^l) \sim \mathcal{D}} \log \sigma \Big( \eta \big[ \log \frac{T_\theta(y_1^w | x, a_1)}{T_{\text{ref}}(y_1^w | x, a_1)} - \log \frac{T_\theta(y_1^l | x, a_1)}{T_{\text{ref}}(y_1^l | x, a_1)} + \sum_{h=1}^{H} \log \frac{T_\theta(a_h^w, y_h^w | s_h)}{T_{\text{ref}}(a_h^w, y_h^w | s_h)} - \log \frac{T_\theta(a_h^l, y_h^l | s_h)}{T_{\text{ref}}(a_h^l, y_h^l | s_h)} \big] \Big)

where:

  • τw\tau^w is the winning trajectory.
  • τl\tau^l is the losing trajectory.
  • TθT_\theta is the policy being optimized.
  • TrefT_{\text{ref}} is the reference policy.
  • σ\sigma is the sigmoid function.
  • η\eta is a regularization coefficient.

The models are evaluated on mathematical reasoning abilities using benchmarks including MATH500, OlympiadBench, and Minerva Math. Evaluation metrics include turn 1 accuracy, final accuracy, improvement in accuracy from the first attempt to the final answer (Δ(t1,t2)\Delta(t_1, t_2)), fraction of problems changed from incorrect to correct (Δic(t1,t2)\Delta_{i \rightarrow c}(t_1, t_2)), and fraction of problems changed from correct to incorrect (Δci(t1,t2)\Delta_{c \rightarrow i}(t_1, t_2)).

The main results demonstrate that intrinsic self-correction with prompting generally fails, while self-rewarding reasoning models significantly outperform existing baselines. For instance, on MATH500, self-rewarding IFT achieves Δic\Delta_{i \rightarrow c} = 5.0% and Δci\Delta_{c \rightarrow i} = 0.4%. Self-rewarding reasoning models also improve final accuracy compared to single-turn baselines. The paper finds that deep RL algorithms outperform direct alignment algorithms.

Further experiments were performed using a simplified two-turn conversation framework and Llama models, confirming the generality of the proposed framework. Ablation studies on data distribution show that the data composition in self-rewarding IFT influences the outcome supervised reward model (ORM) accuracy. The paper also investigates additional rule designs in RL training.

The paper concludes by highlighting the effectiveness of the self-rewarding reasoning framework in enhancing self-correction capabilities and computational efficiency. Future research directions include addressing the lower reward model accuracy compared to external ORMs, incorporating multi-turn RL methods, and extending the framework to step-wise correction.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Wei Xiong (172 papers)
  2. Hanning Zhang (12 papers)
  3. Chenlu Ye (14 papers)
  4. Lichang Chen (30 papers)
  5. Nan Jiang (210 papers)
  6. Tong Zhang (569 papers)
Youtube Logo Streamline Icon: https://streamlinehq.com