The paper introduces a novel self-rewarding reasoning framework for LLMs that integrates generation and evaluation into a single model. This framework aims to enhance the self-correction capabilities of LLMs in mathematical reasoning tasks, reducing computational overhead compared to approaches relying on external reward models.
The key contributions of the paper are:
- A self-rewarding reasoning framework integrating the generator and reward model into a single LLM, enabling autonomous reasoning, evaluation, and correction.
- A two-stage algorithmic framework for self-correction in mathematical reasoning, relying only on self-generated data. The first stage uses sequential rejection sampling to construct long chain-of-thought (CoT) trajectories encoding self-rewarding and self-correction behaviors. The second stage enhances these behaviors through reinforcement learning with rule-based signals.
- Empirical validation demonstrating that self-rewarding correction significantly outperforms intrinsic self-correction.
The self-rewarding reasoning process is formulated as a multi-turn Markov Decision Process (MDP). An LLM generates an initial reasoning attempt given a prompt from a distribution , where is the LLM. It then self-rewards its response by generating an evaluation . If the model assesses its answer as correct ( = [VERIFY] correct), the generation stops. Otherwise, the LLM generates a refined response and evaluation , conditioned on the updated state . The self-refinement continues until the model produces a self-evaluation assessing the answer as correct.
where:
- is the initial prompt.
- is the initial reasoning attempt.
- is the LLM.
- is the self-rewarding evaluation.
The two-stage training framework consists of:
- Self-rewarding instruction-following fine-tuning (IFT): An initial LLM is fine-tuned using demonstration data collected via sequential rejection sampling, resulting in an improved model integrating self-rewarding reasoning abilities.
- Reinforcement learning (RL) optimization: is further refined using RL, leveraging it as the reference model. This stage enhances the model's ability to assess correctness and refine responses.
The self-rewarding signal is trained by token prediction, where models include reasoning in their evaluations and output specific tokens to indicate their evaluation results, such as "[VERIFY] correct" and "[VERIFY] wrong". Data collection uses a rejection sampling approach, generating self-correction trajectories and preserving desired ones. The process includes generating initial reasoning responses, sampling self-rewarding signals, and correction sampling. The LLMs are fine-tuned using a standard SFT pipeline to maximize:
where:
- is the initial prompt.
- is the initial reasoning attempt.
- is the self-rewarding evaluation of the first turn.
- is the revised reasoning attempt.
For the RL stage, the paper considers both deep RL methods and direct alignment algorithms. A trajectory-wise reward function is used for trajectory , where is the horizon. The oracle reward is used, where is the ground-truth verifier. The KL-regularized objective is:
where:
- is the policy being optimized.
- is the initial LLM.
- is the reference model.
- is the trajectory-wise reward.
- is the Kullback-Leibler divergence.
- is a regularization coefficient.
The paper also adopts Direct Preference Optimization (DPO) to solve the equation, using the multi-turn DPO (M-DPO) framework. The loss function is:
where:
- is the winning trajectory.
- is the losing trajectory.
- is the policy being optimized.
- is the reference policy.
- is the sigmoid function.
- is a regularization coefficient.
The models are evaluated on mathematical reasoning abilities using benchmarks including MATH500, OlympiadBench, and Minerva Math. Evaluation metrics include turn 1 accuracy, final accuracy, improvement in accuracy from the first attempt to the final answer (), fraction of problems changed from incorrect to correct (), and fraction of problems changed from correct to incorrect ().
The main results demonstrate that intrinsic self-correction with prompting generally fails, while self-rewarding reasoning models significantly outperform existing baselines. For instance, on MATH500, self-rewarding IFT achieves = 5.0% and = 0.4%. Self-rewarding reasoning models also improve final accuracy compared to single-turn baselines. The paper finds that deep RL algorithms outperform direct alignment algorithms.
Further experiments were performed using a simplified two-turn conversation framework and Llama models, confirming the generality of the proposed framework. Ablation studies on data distribution show that the data composition in self-rewarding IFT influences the outcome supervised reward model (ORM) accuracy. The paper also investigates additional rule designs in RL training.
The paper concludes by highlighting the effectiveness of the self-rewarding reasoning framework in enhancing self-correction capabilities and computational efficiency. Future research directions include addressing the lower reward model accuracy compared to external ORMs, incorporating multi-turn RL methods, and extending the framework to step-wise correction.