RL's Razor: Minimizing KL in Continual Learning

Updated 5 September 2025

RL’s Razor is the inductive principle selecting policies with minimal KL divergence from the base policy to reduce catastrophic forgetting in reinforcement learning.
It demonstrates that online RL fine-tuning preserves original capabilities better than supervised fine-tuning by retaining close alignment to the base distribution.
Empirical studies in language modeling and robotics validate that minimal KL shifts achieve high new-task performance while mitigating loss of prior skills.

RL’s Razor is the inductive principle stating that, among all candidate solutions to a new reinforcement learning (RL) task, online RL is naturally biased toward those solutions closest—by Kullback–Leibler (KL) divergence—to the original policy. This principle underlies the observation that RL adaptation preserves prior knowledge and capabilities significantly better than supervised fine-tuning (SFT), even when both achieve similar performance on the new task. RL’s Razor quantitatively links the degree of post-adaptation forgetting to the distributional shift measured as KL divergence between the fine-tuned and base policy, and provides both theoretical and empirical justification for preferring KL-minimal policy updates in continual and lifelong learning.

1. Principle and Formal Statement

RL’s Razor posits that, given multiple policies achieving optimal or acceptable reward on a new task, online RL selects solutions that are minimally shifted from the original policy $\pi_0$ when measured by the forward KL divergence $D_{KL}(\pi_0 \| \pi)$ . Formally, \vspace{0.25em}

$\text{RL's Razor:} \qquad \pi^{\dag} = \arg\min_{\pi \in \mathcal{P}^* \cap \Pi} D_{KL}(\pi \| \pi_0)$

where $\mathcal{P}^*$ is the set of task-solving policies and $\Pi$ is the representable policy family (Shenfeld et al., 4 Sep 2025).

This selection mechanism is implicit in standard RL algorithms and is a direct consequence of on-policy gradient updates: since the trajectory distribution is induced by $\pi_0$ , the updates reweight outputs common or probable under $\pi_0$ , and avoid trajectories that radically shift the policy distribution.

KL divergence thus serves as the key quantitative indicator for catastrophic forgetting: a larger KL shift on the new task correlates with a greater loss of previously acquired capabilities.

2. Contrast with Supervised Fine-Tuning (SFT)

While both RL and SFT achieve comparable accuracy on new tasks, RL fine-tuning leads to substantially lower forgetting of prior capabilities. SFT relies upon cross-entropy loss with labels sampled from the target task’s distribution, often with arbitrary divergence from $\pi_0$ : $L_{SFT}(\pi) = -\mathbb{E}_{x, y \sim \pi_\beta}\, [\log\, \pi(y|x)]$ This can drive the model toward solutions that are far from the base distribution and—if the supervision signal is unrepresentative—cause catastrophic forgetting.

In contrast, RL with policy gradients employs sampled trajectories from the current policy and advantage-weighted updates: $L_{RL}(\pi) = -\mathbb{E}_{x, y \sim \pi}\, [A(x, y) \log\, \pi(y|x)]$ This updates the policy along regions of policy-space where $\pi_0$ already assigns mass, thus reducing the KL shift, and by extension, the degree of forgetting.

Empirically, Pareto frontier analyses in both language modeling and robotics show that RL achieves high new-task performance while retaining prior abilities far better than SFT, which typically exhibits high degradation when performance is optimized for the new task.

3. Theoretical Mechanism: KL-Minimal Policy Projection

The optimal fine-tuned policy under RL can be interpreted as an $M$ -projection: $q^* = \arg\min_q\, \{ D_{KL}(q \| \pi) \mid \mathbb{E}_q[R] = 1 \}$ RL algorithms iteratively project toward policies that minimally change with respect to $\pi_0$ while enforcing new task constraints $R$ , converging to KL-minimal solutions within the set of solutions $\mathcal{P}^*$ consistent with high reward. This is distinct from SFT, which solves: $q^* = \arg\min_q\, \{ D_{KL}(q \| \pi_\beta) \}$ with potentially arbitrary alignment to $\pi_0$ .

In binary-reward settings, policy gradient updates naturally minimize KL divergence to the base policy as subject to new-task performance, and this can be interpreted as a form of EM.

4. Empirical Validation

Extensive experiments in LLMs (across math reasoning, scientific QA, tool-use), robotic manipulation (pick-and-place), and synthetic datasets demonstrate:

RL fine-tuning achieves high accuracy on the new task with minimal loss on unrelated prior tasks, as evidenced by task-agnostic benchmarks (e.g., Hellaswag, TruthfulQA).
SFT, even when driven to similar new-task accuracy, incurs dramatic forgetting on benchmarks due to higher KL divergence from $\pi_0$ .
Controlled studies (ParityMNIST) confirm that catastrophic forgetting is predicted by KL divergence rather than the optimization method per se: SFT guided to minimize KL achieves similar retention to RL.
Retention after RL fine-tuning is robust across tasks and model scales, confirming that KL-minimal adaptation is the underlying mechanism.

5. Implications for Continual Learning and Post-Training Adaptation

RL’s Razor suggests a design axis for algorithms in continual learning: minimizing distributional shift (KL divergence) from the base policy is paramount for preserving prior skills and knowledge. Rather than optimizing solely for new-task returns, post-training updates should target KL-minimal solutions within the admissible policy space. Hybrid methods, combining supervised sample efficiency with RL-style KL constraints, may achieve preferable trade-offs.

For long-lived agents and foundation models, adhering to RL’s Razor ensures resilience against catastrophic forgetting when continuously adapting to new objectives or data distributions. This is especially critical in safety-critical or multi-task settings where retention of generalized capabilities is decisive.

6. Practical Considerations and Future Directions

The quantification of KL divergence as the predictor for forgetting supports monitoring and controlling distributional shift during adaptation. RL’s Razor motivates new metrics, training objectives, and regularization schemes that foreground KL as the primary constraint in evolving foundation models. A plausible implication is that scalable continual learning systems should explicitly regularize policy updates to remain as close as possible to the base distribution, thus operationalizing RL’s Razor for lifelong learning regimes.

RL’s Razor thereby reframes the trade-off in adaptive learning: among all ways to solve a new task, reinforcement learning naturally prefers solutions closest in KL divergence to the original model, leading to maximal retention of prior knowledge.

PDF Markdown Chat (Pro)

References (1)

RL's Razor: Why Online Reinforcement Learning Forgets Less (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to RL's Razor.