Retaining by Doing: The Role of On-Policy Data in Mitigating Forgetting (2510.18874v1)

Published 21 Oct 2025 in cs.LG and cs.CL

Abstract: Adapting LMs to new tasks via post-training carries the risk of degrading existing capabilities -- a phenomenon classically known as catastrophic forgetting. In this paper, toward identifying guidelines for mitigating this phenomenon, we systematically compare the forgetting patterns of two widely adopted post-training methods: supervised fine-tuning (SFT) and reinforcement learning (RL). Our experiments reveal a consistent trend across LM families (Llama, Qwen) and tasks (instruction following, general knowledge, and arithmetic reasoning): RL leads to less forgetting than SFT while achieving comparable or higher target task performance. To investigate the cause for this difference, we consider a simplified setting in which the LM is modeled as a mixture of two distributions, one corresponding to prior knowledge and the other to the target task. We identify that the mode-seeking nature of RL, which stems from its use of on-policy data, enables keeping prior knowledge intact when learning the target task. We then verify this insight by demonstrating that the use on-policy data underlies the robustness of RL to forgetting in practical settings, as opposed to other algorithmic choices such as the KL regularization or advantage estimation. Lastly, as a practical implication, our results highlight the potential of mitigating forgetting using approximately on-policy data, which can be substantially more efficient to obtain than fully on-policy data.

Summary

The paper shows that RL's use of on-policy data dramatically reduces catastrophic forgetting compared to SFT during language model adaptation.
It employs comparative analysis of forward KL (mode-covering) versus reverse KL (mode-seeking) to explain how data selection impacts forgetting dynamics.
The study highlights practical strategies, including Iterative-SFT, to approximate on-policy conditions, suggesting avenues for continuous learning frameworks.

Retaining by Doing: The Role of On-Policy Data in Mitigating Forgetting

Introduction

The paper "Retaining by Doing: The Role of On-Policy Data in Mitigating Forgetting" (2510.18874) addresses the issue of catastrophic forgetting in LMs during post-training. Catastrophic forgetting refers to the degradation of previously acquired capabilities when a model is adapted to new tasks. This phenomenon is prevalent in techniques like supervised fine-tuning (SFT) and reinforcement learning (RL), which are frequently used for task adaptation. The paper compares these two methods, highlighting a consistent trend where RL results in less forgetting than SFT. Through empirical studies and theoretical analysis, the authors identify that the mode-seeking nature of RL, due to its use of on-policy data, plays a critical role in maintaining prior knowledge.

Forgetting in LM Post-Training

Preliminaries

In the context of LM post-training, catastrophic forgetting is measured using metrics like target task gain (the improvement in task performance) and non-target tasks drop (the decline in performance on other tasks). The LLM's policy is evaluated before and after the training to quantify these metrics.

Comparison Between SFT and RL

Empirical results demonstrate that RL consistently achieves higher target task performance with minimal forgetting compared to SFT. RL's robustness to forgetting is attributed mainly to its use of on-policy data, which allows the model to preserve its existing knowledge while adapting to new tasks. Conversely, SFT, which relies on off-policy data, suffers from significant forgetting.

Figure 1: Illustration of the forgetting dynamics for the forward KL objective, corresponding to SFT, and the reverse KL objective, corresponding to RL.

Understanding Forgetting Dynamics Through KL Divergence

Forward KL vs. Reverse KL

SFT can be viewed as minimizing the forward KL divergence, which is mode-covering, while RL minimizes the reverse KL divergence, which is mode-seeking. Initially, intuition suggests that reverse KL should lead to more forgetting due to its fast probability mass movement. However, results indicate the opposite in practical LM scenarios, primarily when the policy is multi-modal. In uni-modal settings, forward KL indeed exhibits less forgetting. This discrepancy is reconciled by considering the behavior of reverse KL in multi-modal settings where it maintains prior modes better.

Figure 2: Reverse KL (RL) with multi-modal training policy forgets less than forward KL (SFT).

The Role of On-Policy Data

On-Policy Data Mitigates Forgetting

Through systematic ablations, the paper demonstrates that the robustness of RL to forgetting primarily stems from its use of on-policy data, rather than other algorithmic components such as KL regularization or advantage estimation. Experiments with variations like non-regularized GRPO and classical REINFORCE reaffirm the central role of on-policy data.

Figure 3: KL regularization is not a major contributor to RL's lesser degree of forgetting.

Approximately On-Policy Data

The study further explores whether partial on-policy setups can mitigate forgetting, finding success with methods like Iterative-SFT, which use approximately on-policy data generated at epochs' start. This approach significantly reduces forgetting while maintaining task performance, suggesting a practical and efficient alternative to fully on-policy RL.

Figure 4: Approximately on-policy data can suffice for mitigating forgetting in SFT.

Implications and Future Directions

The findings underscore the importance of on-policy data in post-training scenarios, advocating for strategies that incorporate such data to reduce forgetting efficiently. The research opens avenues for further exploration on the theoretical underpinnings of on-policy data's role in adaptive learning and suggests directions for developing continuous learning frameworks without catastrophic forgetting.

Conclusion

The paper provides a comprehensive exploration of forgetting mechanisms in LMs, highlighting RL's resilience due to on-policy data usage. These insights are integral for advancing LM adaptation techniques, emphasizing efficient data strategies to maintain model capabilities over time.