S$^2$R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning

Published 18 Feb 2025 in cs.CL and cs.LG | (2502.12853v1)

Abstract: Recent studies have demonstrated the effectiveness of LLM test-time scaling. However, existing approaches to incentivize LLMs' deep thinking abilities generally require large-scale data or significant training efforts. Meanwhile, it remains unclear how to improve the thinking abilities of less powerful base models. In this work, we introduce S$^2$R, an efficient framework that enhances LLM reasoning by teaching models to self-verify and self-correct during inference. Specifically, we first initialize LLMs with iterative self-verification and self-correction behaviors through supervised fine-tuning on carefully curated data. The self-verification and self-correction skills are then further strengthened by both outcome-level and process-level reinforcement learning, with minimized resource requirements, enabling the model to adaptively refine its reasoning process during inference. Our results demonstrate that, with only 3.1k self-verifying and self-correcting behavior initialization samples, Qwen2.5-math-7B achieves an accuracy improvement from 51.0\% to 81.6\%, outperforming models trained on an equivalent amount of long-CoT distilled data. Extensive experiments and analysis based on three base models across both in-domain and out-of-domain benchmarks validate the effectiveness of S$^2$R. Our code and data are available at https://github.com/NineAbyss/S2R.

Abstract PDF Upgrade to Chat

Authors (9)

Summary

The paper demonstrates how integrating supervised fine-tuning with reinforcement learning empowers LLMs to self-verify and self-correct during inference.
It shows significant accuracy gains, notably from 51.0% to 81.6% on the MATH500 benchmark using offline RL and rule-based rewards.
The framework’s broad applicability extends to logical reasoning and strategy tasks, enhancing performance across diverse AI challenges.

S $^2$ R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning

Introduction

The paper "S $^2$ R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning" addresses a critical challenge in the domain of LLMs concerning their reasoning capabilities. While LLMs show promise in various tasks, enhancing their deep thinking abilities remains complex, particularly for smaller models or models that are less resource-intensive. The proposed framework, S $^2$ r, provides an efficient alternative to address this gap by focusing on self-verification and self-correction mechanisms incorporated during inference. It leverages a combination of supervised fine-tuning and reinforcement learning to achieve significant improvements in reasoning accuracy without extensive resources or data.

Methodology

Behavior Initialization

The methodology involves a two-stage training process. Initially, the behavior of LLMs is initialized through supervised fine-tuning (SFT) using a specifically curated dataset of 3.1k samples that teach the models self-verifying and self-correcting behaviors. This dataset is constructed via prompting and sampling from existing model responses, focusing on capturing diverse reasoning trajectories through both "problem-solving" and "confirmative" verification methods.

Reinforcement Learning

Subsequently, the reasoning capabilities are enhanced through reinforcement learning (RL). The RL strategy comprises both outcome-level and process-level RL, allowing models to focus on the correctness of both solutions and intermediate reasoning steps. The framework introduces offline RL as a scalable alternative to traditional online training, utilizing rule-based rewards and baseline estimations informed by accuracy bins.

Figure 1: Overview of S $^2$ r.

Experiments and Results

The effectiveness of S $^2$ r was validated across seven mathematical benchmarks, demonstrating significant improvements over baseline LLMs, including models such as Llama-3.1-8B-Instruct, Qwen2-7B-Instruct, and Qwen2.5-Math-7B. Notably, the Qwen2.5-Math-7B model exhibited a substantial increase in accuracy from 51.0% to 81.6% on the MATH500 test set. These performance gains were achieved with optimized use of resources, showcasing S $^2$ r's efficiency.

A comparison was made between the S $^2$ r framework and alternative methods, such as long-CoT data distillations and competitive model baselines. S $^2$ r consistently delivered superior accuracy, particularly in complex reasoning tasks where test-time computation was effectively scaled.

Figure 2: The data efficiency of S $^2$ r compared to competitive methods, with all models initialized from Qwen2.5-Math-7B.

Analysis of Self-Verification and Self-Correction

Analytical experiments highlighted the impact of verification methods on overall performance. The paper critiques "problem-solving" verification's bias during LLM reasoning processes, advocating for the more balanced "confirmative" verification approach despite its generally lower raw accuracy. Both RL methodologies were found to enhance the self-correction from incorrect to correct states significantly, while reducing the reverse, ensuring stable outputs in challenging scenarios.

Figure 3: Evaluation on verification and correction.

Generalizability and Cross-domain Application

Beyond mathematical reasoning, experiments demonstrated the framework's applicability to cross-domain tasks such as logical reasoning and strategy question answering (StrategyQA). The generalizability of self-verifying and self-correcting paradigms suggests potential integrations into broader AI domains requiring intricate understanding and decision-making capabilities.

Figure 4: StrategyQA Case.

Conclusion

S $^2$ r stands as a robust framework enabling efficient enhancement of LLM reasoning through structured self-verification and self-correction mechanisms. Its scalable methodologies, including offline RL, provide a significant leap forward in making advanced reasoning accessible to smaller or less resource-intensive models. Future exploration may involve integrating this approach across diverse AI challenges, extending the benefits of deep reasoning to a broader set of applications.