Bridging Supervised Learning and Reinforcement Learning in Math Reasoning (2505.18116v2)

Published 23 May 2025 in cs.LG and cs.CL

Abstract: Reinforcement Learning (RL) has played a central role in the recent surge of LLMs' math abilities by enabling self-improvement through binary verifier signals. In contrast, Supervised Learning (SL) is rarely considered for such verification-driven training, largely due to its heavy reliance on reference answers and inability to reflect on mistakes. In this work, we challenge the prevailing notion that self-improvement is exclusive to RL and propose Negative-aware Fine-Tuning (NFT) -- a supervised approach that enables LLMs to reflect on their failures and improve autonomously with no external teachers. In online training, instead of throwing away self-generated negative answers, NFT constructs an implicit negative policy to model them. This implicit policy is parameterized with the same positive LLM we target to optimize on positive data, enabling direct policy optimization on all LLMs' generations. We conduct experiments on 7B and 32B models in math reasoning tasks. Results consistently show that through the additional leverage of negative feedback, NFT significantly improves over SL baselines like Rejection sampling Fine-Tuning, matching or even surpassing leading RL algorithms like GRPO and DAPO. Furthermore, we demonstrate that NFT and GRPO are actually equivalent in strict-on-policy training, even though they originate from entirely different theoretical foundations. Our experiments and theoretical findings bridge the gap between SL and RL methods in binary-feedback learning systems.

Summary

Bridging Supervised Learning and Reinforcement Learning in Math Reasoning

The paper "Bridging Supervised Learning and Reinforcement Learning in Math Reasoning" presents a novel approach to enable self-improvement in LLMs tasked with mathematical reasoning, a domain where Reinforcement Learning (RL) has demonstrated considerable advantages. The prevailing belief that only RL is suitable for verification-driven training is challenged through the introduction of Negative-aware Fine-Tuning (NFT), a supervised learning technique leveraging negative feedback.

Context and Design

LLMs have recently exhibited improved proficiency in mathematics, largely attributed to strategic shifts from imitation learning—dependent on reference answers—to self-reflective learning paradigms that utilize binary verifier signals to evaluate model-generated solutions. While RL algorithms such as Proximal Policy Optimization (PPO), GRPO, and DAPO are specifically engineered to maximize such verifier feedback, they often necessitate complex reward structures and are computationally expensive.

NFT seeks to address the limitations inherent in supervised learning approaches like Rejection sampling Fine-Tuning (RFT), which traditionally fails to consider negative feedback (i.e., incorrect answers). The NFT method bridges this gap by constructing an implicit negative policy. This implicit policy allows the model to optimize its internal parameters based on negative data, thereby enhancing self-reflective learning without external supervision.

Core Methodology

The key innovation in NFT lies in its methodology for incorporating negative feedback into LLM training. Instead of discarding incorrect or negative answers, NFT models these responses through a constructed negative policy parameterized by the same model aimed at optimizing positive outcomes. Through experiments carried out with 7B and 32B parameter models, NFT has shown significant improvements against baseline SL methods, matching or surpassing RL algorithms like GRPO and DAPO.

The paper also draws an intriguing theoretical parallel between NFT and GRPO. Despite originating from separate foundations—Maximum Likelihood Estimation for NFT and Policy Gradient for GRPO—their convergence behavior is strikingly similar in strict on-policy scenarios. This discovery not only encourages integrating negative feedback in supervised paradigms but also hints at shared mechanisms underlying effective optimizer dynamics in LLMs.

Implications and Future Directions

The implications of NFT are multifold. Practically, it provides a clearer path to reducing oversight costs and dependency on high-quality external labeled data by leveraging self-generated feedback. Theoretically, it invites further exploration into the unification of supervised and reinforcement learning strategies, particularly in domains marked by binary-feedback systems.

Moving forward, the NFT approach opens avenues for refining LLM training protocols, especially for reasoning tasks. This could involve extending the technique to more complex multi-step reasoning tasks or integrating models that more dynamically adjust based on evolving feedback patterns. Given these insights, future research may also explore how NFT can be generalized beyond math reasoning to other domains where LLMs could benefit from feedback-driven improvements without explicit external intervention.

In summary, "Bridging Supervised Learning and Reinforcement Learning in Math Reasoning" represents a significant step in understanding how supervised learning can be enhanced through self-improvement techniques typically reserved for RL applications. By incorporating negative feedback, NFT provides a compelling alternative for advancing LLM capabilities, potentially influencing a wide range of applications within artificial intelligence.

Tweets

https://twitter.com/chaumian/status/1927023568736399682

https://twitter.com/priyanshuwtf/status/1936150184062575061

https://twitter.com/chenhuay17/status/1927846255486554535

https://twitter.com/GptMaestro/status/1940273124639220042

Bridging Supervised Learning and Reinforcement Learning in Math Reasoning (2505.18116v2)

Summary