Bridging Supervised Learning and Reinforcement Learning in Math Reasoning
The paper "Bridging Supervised Learning and Reinforcement Learning in Math Reasoning" presents a novel approach to enable self-improvement in LLMs tasked with mathematical reasoning, a domain where Reinforcement Learning (RL) has demonstrated considerable advantages. The prevailing belief that only RL is suitable for verification-driven training is challenged through the introduction of Negative-aware Fine-Tuning (NFT), a supervised learning technique leveraging negative feedback.
Context and Design
LLMs have recently exhibited improved proficiency in mathematics, largely attributed to strategic shifts from imitation learning—dependent on reference answers—to self-reflective learning paradigms that utilize binary verifier signals to evaluate model-generated solutions. While RL algorithms such as Proximal Policy Optimization (PPO), GRPO, and DAPO are specifically engineered to maximize such verifier feedback, they often necessitate complex reward structures and are computationally expensive.
NFT seeks to address the limitations inherent in supervised learning approaches like Rejection sampling Fine-Tuning (RFT), which traditionally fails to consider negative feedback (i.e., incorrect answers). The NFT method bridges this gap by constructing an implicit negative policy. This implicit policy allows the model to optimize its internal parameters based on negative data, thereby enhancing self-reflective learning without external supervision.
Core Methodology
The key innovation in NFT lies in its methodology for incorporating negative feedback into LLM training. Instead of discarding incorrect or negative answers, NFT models these responses through a constructed negative policy parameterized by the same model aimed at optimizing positive outcomes. Through experiments carried out with 7B and 32B parameter models, NFT has shown significant improvements against baseline SL methods, matching or surpassing RL algorithms like GRPO and DAPO.
The paper also draws an intriguing theoretical parallel between NFT and GRPO. Despite originating from separate foundations—Maximum Likelihood Estimation for NFT and Policy Gradient for GRPO—their convergence behavior is strikingly similar in strict on-policy scenarios. This discovery not only encourages integrating negative feedback in supervised paradigms but also hints at shared mechanisms underlying effective optimizer dynamics in LLMs.
Implications and Future Directions
The implications of NFT are multifold. Practically, it provides a clearer path to reducing oversight costs and dependency on high-quality external labeled data by leveraging self-generated feedback. Theoretically, it invites further exploration into the unification of supervised and reinforcement learning strategies, particularly in domains marked by binary-feedback systems.
Moving forward, the NFT approach opens avenues for refining LLM training protocols, especially for reasoning tasks. This could involve extending the technique to more complex multi-step reasoning tasks or integrating models that more dynamically adjust based on evolving feedback patterns. Given these insights, future research may also explore how NFT can be generalized beyond math reasoning to other domains where LLMs could benefit from feedback-driven improvements without explicit external intervention.
In summary, "Bridging Supervised Learning and Reinforcement Learning in Math Reasoning" represents a significant step in understanding how supervised learning can be enhanced through self-improvement techniques typically reserved for RL applications. By incorporating negative feedback, NFT provides a compelling alternative for advancing LLM capabilities, potentially influencing a wide range of applications within artificial intelligence.