RL on Incorrect Synthetic Data Scales the Efficiency of LLM Math Reasoning by Eight-Fold (2406.14532v1)

Published 20 Jun 2024 in cs.LG and cs.CL

Abstract: Training on model-generated synthetic data is a promising approach for finetuning LLMs, but it remains unclear when it helps or hurts. In this paper, we investigate this question for math reasoning via an empirical study, followed by building a conceptual understanding of our observations. First, we find that while the typical approach of finetuning a model on synthetic correct or positive problem-solution pairs generated by capable models offers modest performance gains, sampling more correct solutions from the finetuned learner itself followed by subsequent fine-tuning on this self-generated data $\textbf{doubles}$ the efficiency of the same synthetic problems. At the same time, training on model-generated positives can amplify various spurious correlations, resulting in flat or even inverse scaling trends as the amount of data increases. Surprisingly, we find that several of these issues can be addressed if we also utilize negative responses, i.e., model-generated responses that are deemed incorrect by a final answer verifier. Crucially, these negatives must be constructed such that the training can appropriately recover the utility or advantage of each intermediate step in the negative response. With this per-step scheme, we are able to attain consistent gains over only positive data, attaining performance similar to amplifying the amount of synthetic data by $\mathbf{8 \times}$. We show that training on per-step negatives can help to unlearn spurious correlations in the positive data, and is equivalent to advantage-weighted reinforcement learning (RL), implying that it inherits robustness benefits of RL over imitating positive data alone.

Citations (25)

View on Semantic Scholar

Summary

The paper shows that reinforcement learning with negative synthetic data scales LLM math reasoning efficiency by eight-fold.
It employs per-step credit assignment via an advantage-weighted RL framework to mitigate spurious correlations in the model's reasoning.
The study reveals that combining self-generated positive responses with verified negative data yields a robust strategy for enhancing LLM training.

Empirical Study on Synthetic Data Utilization in LLM Math Reasoning

The paper "RL on Incorrect Synthetic Data Scales the Efficiency of LLM Math Reasoning by Eight-Fold" by Setlur et al. provides an empirical analysis of synthetic data's role in fine-tuning LLMs for enhanced math reasoning capabilities. The authors investigate the effects of different types of synthetic data—specifically, positive and negative model-generated responses—on the overall performance of LLMs using supervised fine-tuning (SFT) and reinforcement learning (RL) techniques.

Study Overview

The core of this work is an extensive empirical evaluation aiming to demystify how incorporating synthetic data, especially in the context of mathematical reasoning, can augment LLM performance. The researchers employ problem-solution pairs generated using high-capability models such as GPT-4 and Gemini 1.5 Pro. These generated pairs include both positive responses (correct solutions) and negative responses (incorrect solutions), which are pivotal for effective model training and avoiding spurious correlations.

Key Findings

Positive Synthetic Data

Performance Gains: Fine-tuning on synthetic data generated by models like GPT-4 yields tangible performance improvements, albeit with diminishing returns as the dataset grows. The growth rate aligns with scaling laws observed in contextually similar tasks but is noticeably slower.
Self-Generated Solutions: The paper highlights that self-generated positive responses—those derived from an SFT model fine-tuned on initial synthetic data set—outperform third-party generated synthetic data by almost doubling the efficiency. This is attributed to the "easier-to-fit" nature of these responses, suggesting reduced memorization and better generalizability.
Spurious Correlations: A critical caveat noted is the amplification of spurious correlations when training solely on positive data. These are incorrect intermediate steps during problem-solving that, while leading to a correct answer, foster undesirable model behaviors that degrade test performance when further scaled.

Negative Synthetic Data

Advantages of Negative Data: Incorporating negative data—responses that do not yield the correct final answer—addresses the limitations of training on positive data alone. The paper introduces a nuanced approach for generating these negative responses, emphasizing the importance of per-step verification to ensure that the model learns to avoid missteps effectively.
Per-Step Credit Assignment: The authors propose an advantage-weighted RL framework that aligns with the Direct Preference Optimization (DPO) method. This framework uses the Q-values of steps in negative responses to identify and mitigate spurious steps in positive solutions, thus enhancing overall model robustness.
Scaling Efficiency: Training that includes per-step verified negative data demonstrates an effective eight-fold increase in data utility, markedly improving model performance over models fine-tuned with positive data alone.

Theoretical and Practical Implications

Theoretical Model: The authors construct a theoretical model demonstrating that integrating negative data for per-step verification under an advantage-weighted RL framework reduces reliance on spurious steps. This conceptual model illustrates how accurate advantage estimation at critical reasoning steps translates to better generalization and performance.
Practical Deployment: This paper provides actionable insights for the deployment of LLMs in mathematical reasoning contexts. The proposed method of leveraging negative synthetic data for per-step credit assignment stands to improve model training procedures without necessitating expansive increases in dataset size.

Future Directions

Future research avenues might include exploring diverse problem domains beyond mathematical reasoning to validate the generalizability of these findings. Additionally, investigating methods to improve the fidelity of synthetic data generation and refining per-step evaluation techniques could further enhance LLM robustness and reduce bias.

Conclusion

Setlur et al. make significant strides in understanding the dynamics of synthetic data use in training LLMs for math reasoning. Their thorough analysis demonstrates the critical role of systematically incorporating self-generated and negative data to mitigate spurious correlations and enhance model performance efficiently. This work has substantial implications for future LLM training methodologies, particularly in scenarios where high-quality real data is scarce, paving the way for more robust and accurate AI systems.

PDF Markdown

Related Papers

Tweets

https://twitter.com/teortaxesTex/status/1870804185509499382

https://twitter.com/aviral_kumar2/status/1805257842812997950

https://twitter.com/setlur_amrith/status/1805265168081023081

https://twitter.com/fly51fly/status/1807049702468395298

https://twitter.com/lateinteraction/status/1883250499304177728

https://twitter.com/teortaxesTex/status/1878136599726170145

Reddit

RL on Incorrect Synthetic Data Scales the Efficiency of LLM Math Reasoning by Eight-Fold, Setlur et al. 2024 (22 points, 1 comment)