Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
Gemini 2.5 Pro
GPT-5
GPT-4o
DeepSeek R1 via Azure
2000 character limit reached

On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification (2508.05629v1)

Published 7 Aug 2025 in cs.LG

Abstract: We present a simple yet theoretically motivated improvement to Supervised Fine-Tuning (SFT) for the LLM, addressing its limited generalization compared to reinforcement learning (RL). Through mathematical analysis, we reveal that standard SFT gradients implicitly encode a problematic reward structure that may severely restrict the generalization capabilities of model. To rectify this, we propose Dynamic Fine-Tuning (DFT), stabilizing gradient updates for each token by dynamically rescaling the objective function with the probability of this token. Remarkably, this single-line code change significantly outperforms standard SFT across multiple challenging benchmarks and base models, demonstrating greatly improved generalization. Additionally, our approach shows competitive results in offline RL settings, offering an effective yet simpler alternative. This work bridges theoretical insight and practical solutions, substantially advancing SFT performance. The code will be available at https://github.com/yongliang-wu/DFT.

Summary

  • The paper demonstrates that the SFT gradient is equivalent to a policy gradient update with an inverse-probability reward, resulting in high variance.
  • It introduces Dynamic Fine-Tuning (DFT) which rescales token-level loss to stabilize optimization and significantly improve generalization, especially on math reasoning benchmarks.
  • Empirical results show DFT consistently outperforms standard SFT and RL-based methods, achieving nearly 6x improvement in accuracy on key mathematical tasks.

Dynamic Fine-Tuning: Rectifying SFT for Improved Generalization via RL Perspective

Introduction

This paper presents a rigorous theoretical and empirical analysis of Supervised Fine-Tuning (SFT) for LLMs, identifying its generalization limitations through the lens of reinforcement learning (RL). The authors demonstrate that the SFT gradient is mathematically equivalent to a policy gradient update with an implicit, ill-posed reward structure—specifically, a reward inversely proportional to the model's probability of generating expert actions. This leads to high-variance, unstable optimization and poor generalization, especially when the model assigns low probability to expert tokens. To address this, the paper introduces Dynamic Fine-Tuning (DFT), a simple yet principled modification that rescales the SFT objective by the token probability, thereby stabilizing updates and improving generalization. Extensive experiments on mathematical reasoning benchmarks show that DFT consistently and substantially outperforms standard SFT and even surpasses state-of-the-art RL-based fine-tuning methods in offline settings.

Theoretical Analysis: SFT as Policy Gradient with Implicit Reward

The core theoretical contribution is the formal equivalence between the SFT gradient and a policy gradient update in RL. By expressing the SFT gradient as an on-policy expectation with importance sampling, the authors show that SFT is a special case of RL with a reward function r(x,y)=1[y=y]r(x, y) = \mathbf{1}[y = y^\star] and an importance weight 1/πθ(yx)1/\pi_\theta(y|x). This structure is problematic: when πθ(yx)\pi_\theta(y^\star|x) is small, the gradient magnitude becomes large, resulting in unbounded variance and a tendency to overfit rare expert trajectories. The optimization landscape is thus dominated by high-variance updates, undermining generalization.

Dynamic Fine-Tuning: Reward Rectification

To neutralize the ill-posed reward structure, DFT introduces a dynamic reweighting of the SFT loss by multiplying with the token probability, i.e., the loss becomes πθ(ytx)logπθ(ytx)-\pi_\theta(y^\star_t|x)\log\pi_\theta(y^\star_t|x) at the token level. This modification cancels the inverse-probability weighting, yielding a stable, uniformly-weighted update. The stop-gradient operator ensures that gradients do not flow through the probability scaling term, maintaining unbiased updates. The result is a simple, one-line change to the standard SFT objective that fundamentally alters the optimization dynamics.

Empirical Results: Mathematical Reasoning Benchmarks

DFT is evaluated on multiple mathematical reasoning datasets (Math500, Minerva Math, Olympiad Bench, AIME 2024, AMC 2023) using several state-of-the-art LLMs (Qwen2.5-Math-1.5B/7B, LLaMA-3.2-3B/8B, DeepSeekMath-7B). Across all models and benchmarks, DFT yields substantial improvements over both base models and standard SFT. For example, on Qwen2.5-Math-1.5B, DFT achieves an average accuracy gain of +15.66 points over the base model, compared to +2.09 for SFT—a nearly 6x improvement. DFT also demonstrates robustness on challenging benchmarks where SFT degrades performance due to overfitting. Figure 1

Figure 1: Accuracy progression for Qwen2.5-MATH-1.5B across mathematical benchmarks, illustrating faster convergence and better performance achieved by DFT relative to SFT.

DFT exhibits faster convergence and higher sample efficiency, reaching optimal performance within the first 120 training steps and outperforming SFT even in early training. This indicates that DFT provides more informative gradient updates and avoids optimization plateaus.

Comparison with RL and Concurrent Methods

The paper benchmarks DFT against offline RL methods (DPO, RFT) and online RL algorithms (PPO, GRPO), as well as the concurrent Importance-Weighted SFT (iw-SFT). In the offline RL setting, DFT achieves the highest average score, outperforming RFT by +11.46 points and GRPO by +3.43 points on Qwen2.5-Math-1.5B. DFT also surpasses iw-SFT in most settings, with higher average accuracy and greater robustness across datasets. Unlike iw-SFT, which requires a reference model for importance weights, DFT derives its weighting directly from the model's own token probabilities, resulting in lower computational overhead.

Token Probability Distribution Analysis

The authors analyze the token probability distributions before and after fine-tuning with DFT, SFT, and RL methods. SFT uniformly increases token probabilities, tightly fitting the training data. In contrast, DFT produces a bimodal distribution, boosting probabilities for some tokens while suppressing others, particularly grammatical or connective words. This suggests that robust generalization may require deprioritizing tokens with low semantic content. Figure 2

Figure 2: Token probability distributions on the training set before training and after fine-tuning with DFT, SFT, and various RL methods including DPO, PPO, and GRPO. A logarithmic scale is used on the y-axis to improve visualization clarity.

Hyperparameter Ablation

Ablation studies on learning rate and batch size confirm that DFT consistently outperforms SFT across all configurations. Both methods are sensitive to learning rate, with intermediate values yielding the best results, but batch size has minimal impact on final accuracy. Figure 3

Figure 3: Ablation paper of training hyper-parameters, learning rates and batch size, for DFT and SFT on Qwen2.5-Math-1.5B model.

Implementation and Practical Considerations

DFT is straightforward to implement: simply multiply the token-level cross-entropy loss by the model's token probability, with a stop-gradient to prevent bias. This modification is computationally efficient and does not require additional sampling, reward models, or reference policies. DFT is robust to hyperparameter choices and scales well across model sizes and dataset sizes. The method is particularly advantageous in settings where only positive expert demonstrations are available and RL is impractical due to resource constraints or lack of reward signals.

Implications and Future Directions

Theoretically, this work clarifies the connection between SFT and RL, pinpointing the source of SFT's generalization gap and providing a principled solution. Practically, DFT offers a scalable, efficient alternative to RL-based fine-tuning, with strong empirical performance on mathematical reasoning tasks. The findings suggest that dynamic reweighting can mitigate overfitting and improve generalization in LLMs, potentially extending to other domains such as code generation, commonsense QA, and multimodal tasks. Future work should explore DFT's applicability to larger models and diverse datasets, as well as its integration with hybrid SFT-RL pipelines.

Conclusion

This paper provides a rigorous theoretical foundation for understanding the limitations of SFT in LLMs and introduces Dynamic Fine-Tuning as a simple, effective remedy. By rectifying the implicit reward structure of SFT, DFT stabilizes optimization and substantially improves generalization, outperforming both standard SFT and RL-based methods in challenging mathematical reasoning tasks. The approach is easy to implement, computationally efficient, and robust across models and datasets, making it a practical solution for LLM alignment in resource-constrained or feedback-limited scenarios. The insights and methodology presented here have significant implications for the future development of scalable, generalizable LLMs.