Self-rewarding correction for mathematical reasoning (2502.19613v1)

Published 26 Feb 2025 in cs.AI and cs.LG

Abstract: We study self-rewarding reasoning LLMs, which can simultaneously generate step-by-step reasoning and evaluate the correctness of their outputs during the inference time-without external feedback. This integrated approach allows a single model to independently guide its reasoning process, offering computational advantages for model deployment. We particularly focus on the representative task of self-correction, where models autonomously detect errors in their responses, revise outputs, and decide when to terminate iterative refinement loops. To enable this, we propose a two-staged algorithmic framework for constructing self-rewarding reasoning models using only self-generated data. In the first stage, we employ sequential rejection sampling to synthesize long chain-of-thought trajectories that incorporate both self-rewarding and self-correction mechanisms. Fine-tuning models on these curated data allows them to learn the patterns of self-rewarding and self-correction. In the second stage, we further enhance the models' ability to assess response accuracy and refine outputs through reinforcement learning with rule-based signals. Experiments with Llama-3 and Qwen-2.5 demonstrate that our approach surpasses intrinsic self-correction capabilities and achieves performance comparable to systems that rely on external reward models.

PDF Abstract

The paper introduces a novel self-rewarding reasoning framework for LLMs that integrates generation and evaluation into a single model. This framework aims to enhance the self-correction capabilities of LLMs in mathematical reasoning tasks, reducing computational overhead compared to approaches relying on external reward models.

The key contributions of the paper are:

A self-rewarding reasoning framework integrating the generator and reward model into a single LLM, enabling autonomous reasoning, evaluation, and correction.
A two-stage algorithmic framework for self-correction in mathematical reasoning, relying only on self-generated data. The first stage uses sequential rejection sampling to construct long chain-of-thought (CoT) trajectories encoding self-rewarding and self-correction behaviors. The second stage enhances these behaviors through reinforcement learning with rule-based signals.
Empirical validation demonstrating that self-rewarding correction significantly outperforms intrinsic self-correction.

The self-rewarding reasoning process is formulated as a multi-turn Markov Decision Process (MDP). An LLM generates an initial reasoning attempt $a_1 \sim \pi_1(\cdot | s_1)$ given a prompt $s_1 = x \in \mathcal{X}$ from a distribution $\mathcal{D}_0$ , where $\pi$ is the LLM. It then self-rewards its response by generating an evaluation $y_1 \sim \pi_1(\cdot | s_1, a_1)$ . If the model assesses its answer as correct ( $y_1$ = [VERIFY] correct), the generation stops. Otherwise, the LLM generates a refined response and evaluation $(a_2, y_2) \sim \pi_2(\cdot | s_2)$ , conditioned on the updated state $s_2 = (s_1, a_1, y_1)$ . The self-refinement continues until the model produces a self-evaluation $y_h$ assessing the answer as correct.

$y_1 \sim \pi_1(\cdot | s_1, a_1)$

$s_2 = (s_1, a_1, y_1)$

where:

$s_1$ is the initial prompt.
$a_1$ is the initial reasoning attempt.
$\pi_1$ is the LLM.
$y_1$ is the self-rewarding evaluation.

The two-stage training framework consists of:

Self-rewarding instruction-following fine-tuning (IFT): An initial LLM $\pi_0$ is fine-tuned using demonstration data collected via sequential rejection sampling, resulting in an improved model $\pi_{\text{ref}}$ integrating self-rewarding reasoning abilities.
Reinforcement learning (RL) optimization: $\pi_{\text{ref}}$ is further refined using RL, leveraging it as the reference model. This stage enhances the model's ability to assess correctness and refine responses.

The self-rewarding signal is trained by token prediction, where models include reasoning in their evaluations and output specific tokens to indicate their evaluation results, such as "[VERIFY] correct" and "[VERIFY] wrong". Data collection uses a rejection sampling approach, generating self-correction trajectories and preserving desired ones. The process includes generating initial reasoning responses, sampling self-rewarding signals, and correction sampling. The LLMs are fine-tuned using a standard SFT pipeline to maximize:

$\mathbb{E}_{\mathcal{D}_{IFT}} [\log P(y_1|x, a_1) + \log P(a_2|x, a_1, y_1)] + \mathbb{E}_{\mathcal{D}_{IFT}} [\log P(a_2|x, a_1, y_1)] + \mathbb{E}_{\mathcal{D}_{IFT}} [\log P(y_1|x, a_1)]$

where:

$x$ is the initial prompt.
$a_1$ is the initial reasoning attempt.
$y_1$ is the self-rewarding evaluation of the first turn.
$a_2$ is the revised reasoning attempt.

For the RL stage, the paper considers both deep RL methods and direct alignment algorithms. A trajectory-wise reward function $u^*(\tau)$ is used for trajectory $\tau = (x, a_1, y_1, \dots, a_H, y_H)$ , where $H$ is the horizon. The oracle reward $u^*(\tau) = r^*(x, a_H)$ is used, where $r^*$ is the ground-truth verifier. The KL-regularized objective is:

$\max_{\pi} \mathbb{E}_{x \sim \mathcal{D}_0, a_1 \sim \pi_0(\cdot | x)} \mathbb{E}_{\tau \sim \pi(\cdot | x, a_1)} \sum_{h=1}^{H} r^*(\tau) - \eta D_{KL}(\pi_h (\cdot | s_h), \pi_{\text{ref}} (\cdot | s_h))$

where:

$\pi$ is the policy being optimized.
$\pi_0$ is the initial LLM.
$\pi_{\text{ref}}$ is the reference model.
$r^*(\tau)$ is the trajectory-wise reward.
$D_{KL}$ is the Kullback-Leibler divergence.
$\eta$ is a regularization coefficient.

The paper also adopts Direct Preference Optimization (DPO) to solve the equation, using the multi-turn DPO (M-DPO) framework. The loss function $C_{\text{M-DPO}}(\theta)$ is:

$- \mathbb{E}_{(\tau^w, \tau^l) \sim \mathcal{D}} \log \sigma \Big( \eta \big[ \log \frac{T_\theta(y_1^w | x, a_1)}{T_{\text{ref}}(y_1^w | x, a_1)} - \log \frac{T_\theta(y_1^l | x, a_1)}{T_{\text{ref}}(y_1^l | x, a_1)} + \sum_{h=1}^{H} \log \frac{T_\theta(a_h^w, y_h^w | s_h)}{T_{\text{ref}}(a_h^w, y_h^w | s_h)} - \log \frac{T_\theta(a_h^l, y_h^l | s_h)}{T_{\text{ref}}(a_h^l, y_h^l | s_h)} \big] \Big)$

where:

$\tau^w$ is the winning trajectory.
$\tau^l$ is the losing trajectory.
$T_\theta$ is the policy being optimized.
$T_{\text{ref}}$ is the reference policy.
$\sigma$ is the sigmoid function.
$\eta$ is a regularization coefficient.

The models are evaluated on mathematical reasoning abilities using benchmarks including MATH500, OlympiadBench, and Minerva Math. Evaluation metrics include turn 1 accuracy, final accuracy, improvement in accuracy from the first attempt to the final answer ( $\Delta(t_1, t_2)$ ), fraction of problems changed from incorrect to correct ( $\Delta_{i \rightarrow c}(t_1, t_2)$ ), and fraction of problems changed from correct to incorrect ( $\Delta_{c \rightarrow i}(t_1, t_2)$ ).

The main results demonstrate that intrinsic self-correction with prompting generally fails, while self-rewarding reasoning models significantly outperform existing baselines. For instance, on MATH500, self-rewarding IFT achieves $\Delta_{i \rightarrow c}$ = 5.0% and $\Delta_{c \rightarrow i}$ = 0.4%. Self-rewarding reasoning models also improve final accuracy compared to single-turn baselines. The paper finds that deep RL algorithms outperform direct alignment algorithms.

Further experiments were performed using a simplified two-turn conversation framework and Llama models, confirming the generality of the proposed framework. Ablation studies on data distribution show that the data composition in self-rewarding IFT influences the outcome supervised reward model (ORM) accuracy. The paper also investigates additional rule designs in RL training.

The paper concludes by highlighting the effectiveness of the self-rewarding reasoning framework in enhancing self-correction capabilities and computational efficiency. Future research directions include addressing the lower reward model accuracy compared to external ORMs, incorporating multi-turn RL methods, and extending the framework to step-wise correction.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Wei Xiong (172 papers)
Hanning Zhang (12 papers)
Chenlu Ye (14 papers)
Lichang Chen (30 papers)
Nan Jiang (210 papers)
Tong Zhang (569 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/AStoeffelbauer/status/1895976934439600221

https://twitter.com/TheTuringPost/status/1896704332084695040

https://twitter.com/dl_weekly/status/1899505706664870394

https://twitter.com/hadif4r/status/1935318498164580624

https://twitter.com/susumuota/status/1901787309411856687

https://twitter.com/susumuota/status/1901787298250838464

YouTube

Show All Videos