- The paper proposes a self-rewarding reasoning framework integrating LLM generation and evaluation for autonomous mathematical self-correction.
- A two-stage algorithm combines self-rewarding instruction-following fine-tuning and reinforcement learning, using only self-generated data.
- Empirical results demonstrate the framework significantly outperforms intrinsic self-correction, improving accuracy on mathematical reasoning benchmarks.
The paper introduces a novel self-rewarding reasoning framework for LLMs that integrates generation and evaluation into a single model. This framework aims to enhance the self-correction capabilities of LLMs in mathematical reasoning tasks, reducing computational overhead compared to approaches relying on external reward models.
The key contributions of the paper are:
- A self-rewarding reasoning framework integrating the generator and reward model into a single LLM, enabling autonomous reasoning, evaluation, and correction.
- A two-stage algorithmic framework for self-correction in mathematical reasoning, relying only on self-generated data. The first stage uses sequential rejection sampling to construct long chain-of-thought (CoT) trajectories encoding self-rewarding and self-correction behaviors. The second stage enhances these behaviors through reinforcement learning with rule-based signals.
- Empirical validation demonstrating that self-rewarding correction significantly outperforms intrinsic self-correction.
The self-rewarding reasoning process is formulated as a multi-turn Markov Decision Process (MDP). An LLM generates an initial reasoning attempt a1∼π1(⋅∣s1) given a prompt s1=x∈X from a distribution D0, where π is the LLM. It then self-rewards its response by generating an evaluation y1∼π1(⋅∣s1,a1). If the model assesses its answer as correct (y1 = [VERIFY] correct), the generation stops. Otherwise, the LLM generates a refined response and evaluation (a2,y2)∼π2(⋅∣s2), conditioned on the updated state s2=(s1,a1,y1). The self-refinement continues until the model produces a self-evaluation yh assessing the answer as correct.
y1∼π1(⋅∣s1,a1)
s2=(s1,a1,y1)
where:
- s1 is the initial prompt.
- a1 is the initial reasoning attempt.
- π1 is the LLM.
- y1 is the self-rewarding evaluation.
The two-stage training framework consists of:
- Self-rewarding instruction-following fine-tuning (IFT): An initial LLM π0 is fine-tuned using demonstration data collected via sequential rejection sampling, resulting in an improved model πref integrating self-rewarding reasoning abilities.
- Reinforcement learning (RL) optimization: πref is further refined using RL, leveraging it as the reference model. This stage enhances the model's ability to assess correctness and refine responses.
The self-rewarding signal is trained by token prediction, where models include reasoning in their evaluations and output specific tokens to indicate their evaluation results, such as "[VERIFY] correct" and "[VERIFY] wrong". Data collection uses a rejection sampling approach, generating self-correction trajectories and preserving desired ones. The process includes generating initial reasoning responses, sampling self-rewarding signals, and correction sampling. The LLMs are fine-tuned using a standard SFT pipeline to maximize:
EDIFT[logP(y1∣x,a1)+logP(a2∣x,a1,y1)]+EDIFT[logP(a2∣x,a1,y1)]+EDIFT[logP(y1∣x,a1)]
where:
- x is the initial prompt.
- a1 is the initial reasoning attempt.
- y1 is the self-rewarding evaluation of the first turn.
- a2 is the revised reasoning attempt.
For the RL stage, the paper considers both deep RL methods and direct alignment algorithms. A trajectory-wise reward function u∗(τ) is used for trajectory τ=(x,a1,y1,…,aH,yH), where H is the horizon. The oracle reward u∗(τ)=r∗(x,aH) is used, where r∗ is the ground-truth verifier. The KL-regularized objective is:
πmaxEx∼D0,a1∼π0(⋅∣x)Eτ∼π(⋅∣x,a1)h=1∑Hr∗(τ)−ηDKL(πh(⋅∣sh),πref(⋅∣sh))
where:
- π is the policy being optimized.
- π0 is the initial LLM.
- πref is the reference model.
- r∗(τ) is the trajectory-wise reward.
- DKL is the Kullback-Leibler divergence.
- η is a regularization coefficient.
The paper also adopts Direct Preference Optimization (DPO) to solve the equation, using the multi-turn DPO (M-DPO) framework. The loss function CM-DPO(θ) is:
−E(τw,τl)∼Dlogσ(η[logTref(y1w∣x,a1)Tθ(y1w∣x,a1)−logTref(y1l∣x,a1)Tθ(y1l∣x,a1)+h=1∑HlogTref(ahw,yhw∣sh)Tθ(ahw,yhw∣sh)−logTref(ahl,yhl∣sh)Tθ(ahl,yhl∣sh)])
where:
- τw is the winning trajectory.
- τl is the losing trajectory.
- Tθ is the policy being optimized.
- Tref is the reference policy.
- σ is the sigmoid function.
- η is a regularization coefficient.
The models are evaluated on mathematical reasoning abilities using benchmarks including MATH500, OlympiadBench, and Minerva Math. Evaluation metrics include turn 1 accuracy, final accuracy, improvement in accuracy from the first attempt to the final answer (Δ(t1,t2)), fraction of problems changed from incorrect to correct (Δi→c(t1,t2)), and fraction of problems changed from correct to incorrect (Δc→i(t1,t2)).
The main results demonstrate that intrinsic self-correction with prompting generally fails, while self-rewarding reasoning models significantly outperform existing baselines. For instance, on MATH500, self-rewarding IFT achieves Δi→c = 5.0% and Δc→i = 0.4%. Self-rewarding reasoning models also improve final accuracy compared to single-turn baselines. The paper finds that deep RL algorithms outperform direct alignment algorithms.
Further experiments were performed using a simplified two-turn conversation framework and Llama models, confirming the generality of the proposed framework. Ablation studies on data distribution show that the data composition in self-rewarding IFT influences the outcome supervised reward model (ORM) accuracy. The paper also investigates additional rule designs in RL training.
The paper concludes by highlighting the effectiveness of the self-rewarding reasoning framework in enhancing self-correction capabilities and computational efficiency. Future research directions include addressing the lower reward model accuracy compared to external ORMs, incorporating multi-turn RL methods, and extending the framework to step-wise correction.