Training Language Models to Self-Correct via Reinforcement Learning (2409.12917v2)

Published 19 Sep 2024 in cs.LG

Abstract: Self-correction is a highly desirable capability of LLMs, yet it has consistently been found to be largely ineffective in modern LLMs. Current methods for training self-correction typically depend on either multiple models, a more advanced model, or additional forms of supervision. To address these shortcomings, we develop a multi-turn online reinforcement learning (RL) approach, SCoRe, that significantly improves an LLM's self-correction ability using entirely self-generated data. To build SCoRe, we first show that variants of supervised fine-tuning (SFT) on offline model-generated correction traces are often insufficient for instilling self-correction behavior. In particular, we observe that training via SFT falls prey to either a distribution mismatch between mistakes made by the data-collection policy and the model's own responses, or to behavior collapse, where learning implicitly prefers only a certain mode of correction behavior that is often not effective at self-correction on test problems. SCoRe addresses these challenges by training under the model's own distribution of self-generated correction traces and using appropriate regularization to steer the learning process into learning a self-correction behavior that is effective at test time as opposed to fitting high-reward responses for a given prompt. This regularization process includes an initial phase of multi-turn RL on a base model to generate a policy initialization that is less susceptible to collapse, followed by using a reward bonus to amplify self-correction. With Gemini 1.0 Pro and 1.5 Flash models, we find that SCoRe achieves state-of-the-art self-correction performance, improving the base models' self-correction by 15.6% and 9.1% respectively on MATH and HumanEval.

Authors (18)

Aviral Kumar (74 papers)
Vincent Zhuang (11 papers)
Rishabh Agarwal (47 papers)
Yi Su (70 papers)
Avi Singh (21 papers)
Kate Baumli (10 papers)
Shariq Iqbal (14 papers)
Colton Bishop (5 papers)
Rebecca Roelofs (19 papers)
Kay McKinney (3 papers)
Disha Shrivastava (15 papers)
Cosmin Paduraru (18 papers)
George Tucker (45 papers)
Doina Precup (206 papers)
Feryal Behbahani (18 papers)
Aleksandra Faust (60 papers)
John D Co-Reyes (1 paper)
Lei M Zhang (1 paper)

Citations (24)

View on Semantic Scholar

Summary

Training LLMs to Self-Correct via Reinforcement Learning

The paper, "Training LLMs to Self-Correct via Reinforcement Learning" by researchers at Google DeepMind, addresses the challenge of endowing LLMs with the ability to perform intrinsic self-correction. The authors identify significant shortcomings in modern LLMs' ability to self-correct without external supervision and propose a multi-turn online reinforcement learning (RL) framework, termed SCoRe (Self-Correction via Reinforcement Learning), to instill self-correcting behaviors using only model-generated data.

Key Contributions

Empirical Analysis of Supervised Fine-Tuning (SFT) Limitations:
- The paper articulates limitations of existing SFT approaches, such as STaR and Pair-SFT, demonstrating that these methods either bias the model toward making minimal edits or suffer from distributional shifts, leading to ineffectual self-correction.
Multi-Turn RL Framework:
- The authors develop a two-stage RL approach that effectively addresses these shortcomings. The first stage trains the model to improve correction performance while maintaining a close alignment with the base model’s initial responses, and the second stage employs reward shaping to ensure that the model learns an intrinsic self-correction strategy.
Strong Empirical Performance:
- Empirical results demonstrate substantial improvements in intrinsic self-correction abilities. When applied to Gemini models, SCoRe achieves notable performance gains on the MATH and HumanEval benchmarks, outperforming the base models by significant margins.

Detailed Methodology

Stage I: Preventing Collapse through Constrained Optimization

The first stage involves fine-tuning the model to optimize for high reward corrections while ensuring the initial responses remain close to those of the base model. This is achieved via a KL-divergence constraint:

$\max Θ E_{x_1, y_1 ∼ π_{θ}(\cdot | x), y_2 ∼ π_θ(\cdot|[x_1, p_1])}[r(y_2, y^*) - β_2 D_{KL}(π_θ(· | x_1) || π_{ref}(· | x_1))]$

Here, the training process aims to produce diverse and high-quality corrections without deviating significantly from the model's original responses, mitigating risks associated with mode collapse.

Stage II: Multi-Turn RL with Reward Shaping

In the second stage, multi-turn RL is conducted to jointly optimize rewards for both the initial and corrected responses:

$\max Θ E_{x_1, y_1 ∼ π_{θ}(\cdot | x), y_2 ∼ π_θ(\cdot|[x_1, p_1])}[(r(y_1, y^*) + r(y_2, y^*)) - β_1 D_{KL}(π_θ(· | x_1) || π_{ref}(· | x_1))]$

Furthermore, reward shaping incorporates a bias towards self-correction by amplifying the importance of corrections that flip the response from incorrect to correct:

$r(y_2, y^*) + α(r(y_2, y^*) - r(y_1, y^*))$

This encourages the model to prioritize substantial corrections over trivial or minor edits, effectively inculcating a robust self-correction strategy.

Experimental Results

The authors evaluate SCoRe on both the MATH and HumanEval benchmarks:

MATH: SCoRe improves the base model's self-correction by 15.6% and achieves significant gains in performance metrics such as Accuracy@t2 and ∆(t1,t2).
HumanEval: SCoRe shows strong performance with substantial gains in self-correction rates, demonstrating the method's effectiveness in a coding context as well.

Implications

Practical Applications

Enhanced Model Robustness: The demonstrated ability to self-correct can significantly enhance practical deployment scenarios where model reliability is critical.
Efficiency: The improved self-correction ability allows models to better use inference-time compute budgets, as evidenced by the demonstrated efficacy of sequential self-correction over parallel sampling.

Theoretical Insights

Multi-Turn Dynamics: The results highlight the importance of multi-turn RL frameworks in learning complex behaviors that single-turn frameworks may fail to capture.
Distributional Alignment: The paper underscores the criticality of training on self-generated data to align training and inference distributions, mitigating issues of distributional shift.

Future Directions

Possible extensions of this work include exploring multi-turn RL frameworks for iterative self-correction beyond two attempts, and unifying the two-stage approach into a cohesive, single-phase learning algorithm. Additionally, integrating richer forms of feedback such as intermediate or fine-grained supervision could further enhance the model’s capabilities.

Conclusion

The research provides a robust framework for training LLMs to self-correct, demonstrating significant performance improvements across various benchmarks. SCoRe's multi-stage RL approach effectively addresses the limitations of traditional SFT methods and paves the way for future advancements in self-improving AI models.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/iScienceLuvr/status/1836983664896254316

https://twitter.com/_philschmid/status/1837121100196594084

https://twitter.com/omarsar0/status/1837228446839361984

https://twitter.com/_akhaliq/status/1837580043091894530

https://twitter.com/aviral_kumar2/status/1843414539943194945

https://twitter.com/IntuitMachine/status/1837423015635677617