Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Process-based Self-Rewarding Language Models (2503.03746v1)

Published 5 Mar 2025 in cs.CL and cs.AI

Abstract: LLMs have demonstrated outstanding performance across various downstream tasks and have been widely applied in multiple scenarios. Human-annotated preference data is used for training to further improve LLMs' performance, which is constrained by the upper limit of human performance. Therefore, Self-Rewarding method has been proposed, where LLMs generate training data by rewarding their own outputs. However, the existing self-rewarding paradigm is not effective in mathematical reasoning scenarios and may even lead to a decline in performance. In this work, we propose the Process-based Self-Rewarding pipeline for LLMs, which introduces long-thought reasoning, step-wise LLM-as-a-Judge, and step-wise preference optimization within the self-rewarding paradigm. Our new paradigm successfully enhances the performance of LLMs on multiple mathematical reasoning benchmarks through iterative Process-based Self-Rewarding, demonstrating the immense potential of self-rewarding to achieve LLM reasoning that may surpass human capabilities.

Summary

Process-based Self-Rewarding LLMs

The paper "Process-based Self-Rewarding LLMs" introduces a paradigm designed to address the limitations of current Self-Rewarding LLMs (SRLMs) in handling mathematical reasoning tasks. While SRLMs self-generate training data by rewarding their outputs, they struggle to achieve satisfactory performance in complex domains, such as mathematical reasoning, where existing paradigms may inadvertently degrade performance due to inadequacy in generating fine-grained reward signals. This paper proposes a novel Process-based Self-Rewarding (PSR) framework that incorporates step-wise reasoning and preference optimization methods into traditional SRLMs, significantly enhancing LLMs' (LLMs) performance on multiple mathematical reasoning benchmarks.

The core idea of PSR is to integrate step-wise LLM-as-a-Judge functionality and iterative preference optimization within the Self-Rewarding framework. This approach empowers LLMs to evaluate intermediate reasoning steps step-by-step, thereby facilitating a refined learning process that focuses on both the final output and the correctness of each reasoning step. Unlike traditional methods where the final decision is based on complete solutions, the proposed PSR considers individual steps, thus addressing issues related to assigning scores to complex, multi-step reasoning tasks. This greater granularity is crucial because achieving the correct final answer does not guarantee accurate reasoning at each step.

Key Findings and Implications

  1. Enhanced Mathematical Reasoning Capabilities: Experiments conducted on models of varying sizes (7B and 72B parameters) demonstrate that the PSR approach improves LLMs' performance on several challenging mathematical benchmarks, including GSM8k, MATH, and specialized competition datasets like AIME and AMC. The paper shows that LLMs iteratively trained using PSR display increased capabilities in both mathematical reasoning and the LLM-as-a-Judge role.
  2. Iterative Self-rewarding Process: The iterative nature of the PSR paradigm is essential. By repeatedly performing the reasoning and preference optimization cycle, LLMs refine their ability to discern and produce preferred reasoning sequences, suggesting their potential to achieve intelligence surpassing human performance in certain tasks.
  3. Step-wise LLM-as-a-Judge: The implementation of step-wise judgment allows for more accurate and reliable feedback during the learning process. Testing showed that this methodology improves consistency and alignment with human evaluations compared to scoring entire solutions, which is notably challenging for complex, long-chain reasoning.
  4. Future Implications: The paper opens pathways for developing more sophisticated models capable of independent reasoning and judgment, suggesting far-reaching implications for AI development in problem-solving domains outside purely mathematical contexts.

Future Directions

The paper anticipates further exploration of the PSR paradigm by leveraging high-quality data to better initialize models and enhance the fine-grained preference optimization process. Expanding beyond mathematical reasoning, the approach could be adapted for other complex reasoning tasks in AI, providing a framework for continual improvement beyond current human-comparable benchmarks. Exploring the effects of additional iterative cycles and refining the step-wise judgment process will be crucial in realizing the full potential of self-rewarding systems across diverse domains.

In summary, the Process-based Self-Rewarding LLMs paradigms offer significant advancements in the field of AI-driven reasoning, providing a framework for sustainable improvement through self-rewarding and judgment. This innovative approach highlights the transformative potential of nuanced self-optimization mechanisms within LLMs, paving the way for future breakthroughs in AI performance and autonomy.