A Quantitative Characterization of Forgetting in Post-Training

Published 12 Mar 2026 in cs.LG, cs.AI, math.ST, and stat.ML | (2603.12163v1)

Abstract: Continual post-training of generative models is widely used, yet a principled understanding of when and why forgetting occurs remains limited. We develop theoretical results under a two-mode mixture abstraction (representing old and new tasks), proposed by Chen et al. (2025) (arXiv:2510.18874), and formalize forgetting in two forms: (i) mass forgetting, where the old mixture weight collapses to zero, and (ii) old-component drift, where an already-correct old component shifts during training. For equal-covariance Gaussian modes, we prove that forward-KL objectives trained on data from the new distribution drive the old weight to zero, while reverse-KL objectives converge to the true target (thereby avoiding mass forgetting) and perturb the old mean only through overlap-gated misassignment probabilities controlled by the Bhattacharyya coefficient, yielding drift that decays exponentially with mode separation and a locally well-conditioned geometry with exponential convergence. We further quantify how replay interacts with these objectives. For forward-KL, replay must modify the training distribution to change the population optimum; for reverse-KL, replay leaves the population objective unchanged but prevents finite-batch old-mode starvation through bounded importance weighting. Finally, we analyze three recently proposed near-on-policy post-training methods, SDFT (arxiv:2601.19897), TTT-Discover (arxiv:2601.16175), and OAPL (arxiv:2602.19362), via the same lens and derive explicit conditions under which each retains old mass and exhibits overlap-controlled drift. Overall, our results show that forgetting can by precisely quantified based on the interaction between divergence direction, geometric behavioral overlap, sampling regime, and the visibility of past behavior during training.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper rigorously quantifies catastrophic forgetting by distinguishing mass forgetting from old-component drift in generative models.
The study employs forward- and reverse-KL objectives to reveal how replay impacts the data distribution and prevents old-mode collapse.
Application to methods like SDFT and TTT-Discover shows that the choice of objective critically influences retention and controls parameter drift.

A Quantitative Characterization of Forgetting in Post-Training

Introduction and Motivation

The study rigorously investigates catastrophic forgetting in post-training phases of generative models using a mixture model abstraction. Contending that prior work lacks a unified, quantitative theoretical account for when and why forgetting occurs, the authors formalize two distinct but often conflated phenomena: mass forgetting (collapse of prior-mode mixture weights) and old-component drift (parameter shift within the retained component). The analytic lens is grounded in the dichotomy between forward and reverse Kullback–Leibler (KL) divergence objectives and their population-level optima, especially in the context of continued post-training with data from distributions representing new behavior.

Formal Model and Definitions

Modeling the learning target as a two-mode mixture $p_\alpha(y) = \alpha\,p_{\mathrm{o}}(y) + (1-\alpha)p_{\mathrm{n}}(y)$ for $\alpha \in [0,1]$ , the learner constructs a model mixture $q_\beta(y) = \beta\,q_{\mathrm{o}}(y) + (1-\beta)q_{\mathrm{n}}(y)$ with explicit mixture coefficient $\beta$ and component distributions. The central definitions are:

Mass Forgetting: Occurs if the population-optimal $\beta^\star = 0$ under the training objective, with $q_\beta(y)$ assigning zero probability to old behavior even when $q_{\mathrm{o}}(y) = p_{\mathrm{o}}(y)$ .
Old-Component Drift: Refers to the deviation of the old-component parameters from $p_{\mathrm{o}}$ , even when mixture mass is retained.

The minimality of the setup (only two well-modeled Gaussian modes with shared covariance) ensures the separability of loss-induced forgetting from representational limitations.

Forward-KL versus Reverse-KL: Analytical Results

Forward-KL (SFT, New-Data-Only):

The forward-KL objective is equivalent to maximum likelihood with respect to the data distribution, often only the new data ( $p = p_{\mathrm{n}}$ ).
The authors prove that $L_{\mathrm{SFT}}(\beta) = \mathrm{KL}(p_{\mathrm{n}}\,\|\,q_\beta)$ is strictly increasing in $\beta$ ; its unique minimizer is $\beta^\star = 0$ , regardless of component accuracy. Therefore, the model loses all old behavior. Gradient flow in logit space ( $\beta = \sigma(\phi)$ ) confirms exponential decay in $\beta$ , with update dynamics governed by exponentially small overlap-dependent terms for well-separated modes.

Reverse-KL (KL-Regularized RL or On-Policy):

The reverse-KL population optimum, $\min_{\theta}\mathrm{KL}(q_\theta\,\|\,p_\alpha)$ , is uniquely aligned— $(\beta^\star, m^\star) = (\alpha, \mu_{\mathrm{n}})$ achieves zero KL. At this stationary point, component drift is exponentially suppressed in the Mahalanobis separation $\delta$ , with all update signals on $m_{\mathrm{o}}$ (the old mean) gated by overlap (Bhattacharyya coefficient).
The local geometry near the optimum is strongly convex with explicit lower bounds, ensuring exponential convergence of simple gradient flow methods.

The analysis is extended to $f$ -divergences, finite-mixture, and log-concave component families, where analogous forgetting and retention regimes are observed.

Replay and Data Exposure: Disparate Roles under Different Objectives

Significant theoretical distinctions are drawn regarding the effect of replay:

For forward-KL, only replay that modifies the data distribution (numerator) alters the population optimum and can prevent $\beta$ -collapse. Mixing a $\lambda$ fraction of old samples in the data moves the optimum to $\beta^\star = \lambda$ . By contrast, denominator (model-side) replay serves only as a hard floor, not driven by optimization.
For reverse-KL, replay via mixture of old samples in batch construction does not alter the population objective but prevents “old-mode starvation” in finite-batch settings. Bounded importance weights ensure low-variance, unbiased gradient estimation and persistent sampling from the old mode.

Application to Recent Near-On-Policy Post-Training Methods

The authors apply the mixture analysis to recently proposed algorithms:

SDFT (Self-Distillation Fine-Tuning): Behaves like a reverse-KL iteration towards a demonstration-anchored teacher with EMA smoothing. Prevents mass forgetting if the anchor exerts persistent influence, and drift remains bounded (and exponentially small with mode separation).
TTT-Discover: Uses an entropic, reward-tilted objective with a KL anchor. If the anchor coefficient is insufficient, old-mode collapse can still occur. However, when mass is retained, old mean drift is again overlap-gated and exponentially small.
OAPL: Constructs improvement targets via an exponential tilt of a frozen reference policy. Only modes present in this reference can be reweighted; cross-mode influence is localized and suppressed by small geometric overlap.

The quantitative bounds and necessary retention conditions for each algorithm are articulated with explicit dependence on mixture parameters, separation $\delta$ , and overlap statistics.

Theoretical and Practical Implications

This analysis establishes that catastrophic forgetting in post-training is not merely a limitation of representation but, often, a direct population-level consequence of the chosen loss function and data regime. Off-policy objectives (e.g., forward-KL on new-only data) intrinsically induce forgetting by mass collapse, while on-policy or KL-regularized RL objectives can robustly preserve old behaviors—with any drift controlled by the exponentially decaying tail interaction of components.

Replay serves fundamentally different purposes depending on the underlying objective. For forward-KL, it is an essential ingredient for achieving any form of population-level retention; for reverse-KL and related on-policy methods, stochastic replay is a practical mechanism for addressing pathologies arising from poor coverage in finite-sample optimization.

By quantifying exactly when forgetting is unavoidable and when it is negligible (or summable), this work provides an analytic platform for designing and evaluating continual learning methods in generative modeling scenarios.

Conclusion

The paper delivers a mathematically precise account of forgetting dynamics in post-training, dissecting the roles of objective choice, mixture geometry, and sampling regime. Forward-KL–based SFT exhibits unavoidable mass forgetting on new-only data; reverse-KL–based methods are naturally aligned to retention, with locality and exponential convergence. Replay’s function is context-dependent: it is requisite for retention in forward-KL, and an optimization stabilizer for reverse-KL. Analysis of modern post-training algorithms via this lens clarifies which mechanisms guarantee retention and under what quantitative thresholds. These insights elucidate which methodological attributes are necessary for robustness to catastrophic forgetting, and suggest principled directions for algorithmic development in future work on continual post-training of large generative models.

Reference:

"A Quantitative Characterization of Forgetting in Post-Training" (2603.12163)

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

Overview: What this paper is about

This paper studies why AI models sometimes “forget” old skills when they learn new ones, and how to stop that from happening. The authors look at post-training for generative models (like big text or image generators) and break the problem down into a simple picture with just two “modes” (think two behaviors): an old one and a new one. They show, with math, exactly when training makes a model:

drop the old behavior entirely (mass forgetting), or
keep some of the old behavior but accidentally shift it (old-component drift).

Then they explain which training goals avoid these problems and why.

The key questions, in plain language

The paper asks:

If a model knows an old skill and is learning a new one, when will training make it forget the old skill?
What kinds of training objectives (the “goal” the model tries to optimize) protect the old skill?
How does reusing old examples (replay) help for different objectives?
Do recent post-training methods behave more like the safe ones or the forgetful ones?

How they studied it (simple analogy + approach)

Imagine your model as a machine that makes two kinds of outputs:

“Old” outputs (what it already does well),
“New” outputs (what you want it to learn).

The model combines these with a mixing knob called a “weight” (how much attention is paid to old vs. new). The paper analyzes what happens to this mixing knob and to each output’s “center” (like the average pattern) during training.

To keep things clear and solvable, they model each behavior as a Gaussian—think two bell-shaped blobs on a map. The distance between blob centers measures how separate the old and new skills are. If the blobs barely overlap, the behaviors are easy to tell apart. If they overlap, they can get mixed up.

They compare two common training objectives:

Forward KL (think: “fit to the data I’m given”): This is like standard supervised fine-tuning (SFT) on whatever data you feed it. If you only give new examples, the model just tries to match the new behavior.
Reverse KL (think: “make my model samples look like a target mix”): This is like on-policy reinforcement learning (RL) with KL regularization, where you sample from the model and push it to match a target that includes both old and new behaviors.

They also study “replay” (feeding some old examples during training) and three recent near–on-policy methods (SDFT, TTT-Discover, OAPL) through the same lens.

A helpful picture they use:

“Responsibilities” are soft assignments of a sample to either old or new—like asking, “for this output, how likely is it that it came from the old blob?” Overlap between the blobs can cause “misassignment” (the new blob claims an old sample or vice versa).
Overlap is measured by the Bhattacharyya coefficient, which shrinks exponentially as the blobs move farther apart. Less overlap = fewer mistakes = less drift.

Main findings (what they discovered and why it matters)

Here are the core results, explained simply:

Forward KL on new-only data causes mass forgetting.
- If you train only on new examples with a “fit-the-data” objective, the best solution (mathematically) is to set the old weight to zero. In other words, the model throws away the old behavior—even if it still “knows” how to produce it.
- Why? Because the training data never reward keeping the old behavior. The gradient keeps nudging the old weight down.
Reverse KL toward a mixed target avoids mass forgetting and limits drift.
- If your target explicitly says “keep α% old and (1−α)% new,” reverse KL prefers exactly that mixture. It does not collapse the old weight.
- Drift (accidental shifting of the old behavior) comes only from misassignments caused by overlap. If the old and new behaviors are well separated, this misassignment is tiny and decreases exponentially with separation. That means the old behavior stays very stable.
Replay helps differently depending on the objective.
- With Forward KL (SFT):
- Adding old samples to the model side only (like mixing them into the predictions) doesn’t fix the optimization target; it just imposes a minimum old weight from the outside. The model still “wants” to put weight on the new side because the data are new-only.
- Adding old samples into the training data itself (the numerator) changes what’s being matched. Then the best solution includes that same fraction of old behavior—so it truly prevents collapse.
- With Reverse KL (RL-style):
- Adding old samples doesn’t change the population objective because the target already includes old behavior. But in practice, replay helps avoid “old-mode starvation” in minibatches (when you rarely see old outputs during training), so the stochastic updates stay well-behaved.
Local training is well-conditioned for Reverse KL.
- Near the correct solution, the reverse-KL loss has nice geometry (like a well-shaped bowl), so gradient-based training converges exponentially fast and stably.
How three recent methods behave (very briefly):
- SDFT: Acts like reverse KL toward a teacher made from the model + a demonstrator. If the demonstrator is strong enough, it keeps old mass and keeps drift small.
- TTT-Discover: Has a mode-seeking tendency. Without a strong KL “anchor,” it can still collapse onto the higher-reward mode. But if anchored, it can retain mass; drift remains limited by overlap.
- OAPL: Improves relative to a frozen reference policy. It can’t create old mass that wasn’t present in the reference, but cross-mode influence is still small when overlap is small.

Why this matters (implications and impact)

Clear guidance for practice:
- If you only fine-tune on new data with a forward-KL-style objective (standard SFT), expect mass forgetting unless you include old data in the training set. Simply mixing old behavior on the model side won’t change the underlying incentive to forget.
- Reverse-KL-style, on-policy methods—common in modern RL-based post-training—naturally preserve old behavior when the target mixture includes it, and they only nudge the old behavior by a tiny amount when old and new are well separated.
Simple rules of thumb:
- Want to keep old behavior with SFT? Put old examples in the training data.
- Using reverse-KL/RL? Keep a target that explicitly includes old behavior, and use replay to stabilize minibatches.
- The more separated the behaviors are, the less the model will accidentally change the old one.
Broader lesson:
- Forgetting isn’t mysterious here—it’s a predictable outcome of the training objective, the data you show, and how much old behavior is visible during training. By choosing objectives and replay strategies carefully, you can precisely control retention versus forgetting.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a consolidated list of concrete gaps and open questions the paper leaves unresolved, aimed at guiding future research.

Beyond two modes: How do the conclusions (mass collapse under forward-KL; overlap-gated drift under reverse-KL) scale to K-mode mixtures, hierarchical behaviors, or a continuum of behaviors, including mode-splitting and merging events?
Unequal and unknown covariances: The core results assume equal-covariance Gaussians; what changes under unequal covariances, covariance misspecification, or non-Gaussian (e.g., heavy-tailed or multimodal) components, and can analogous overlap-controlled drift bounds be derived?
Identifiability under overlap: When modes overlap substantially, mixture parameters are not well-identified; what are sharp conditions under which reverse-KL still avoids mass forgetting and controls drift, and how do bounds degrade with identifiability?
Old component not perfectly “frozen”: The theory often assumes the old component is exactly correct and/or parameter-separated; how do results change when the old component is only approximately correct, shares parameters with the new component, or is entangled in a single neural network without an explicit mixture structure?
Estimating and tracking α in practice: Reverse-KL assumes a known target mixture weight α; how can α be robustly estimated and adapted online without reliable access to old data, and what are the retention guarantees under α misspecification?
Finite-sample guarantees for RL: The paper sketches “old-mode starvation” under minibatch sampling but does not give finite-sample, high-probability guarantees; what replay rates, batch sizes, and importance weight bounds prevent starvation with provable confidence?
Finite-sample SFT guarantees: Forward-KL results are population-level; what finite-sample conditions (batch sizes, number of steps, early stopping) ensure non-collapse or quantify rate of collapse under new-only SFT?
Basin of attraction for reverse-KL: The local PL analysis gives exponential convergence near the optimum; what global conditions, initialization strategies, or algorithmic modifications guarantee reaching this basin in practice?
Sensitivity to covariance/model misspecification: If Σ is estimated or misspecified, how do misassignment probabilities, drift bounds, and convergence rates change, and can adaptive estimators control drift?
Small separation regime: For small δ (large overlap), how large can reverse-KL–induced drift be, and what thresholds delineate “acceptable” drift versus instability?
Extensions to other f-divergences: The appendices claim extensions; which divergences (e.g., α-divergences, JS) preserve the key dichotomy (forward-like mass collapse vs reverse-like retention), and what are the precise constants and regimes?
Regularization to prevent SFT collapse: Can explicit constraints or priors on β (e.g., entropy/Dirichlet priors, β-floor constraints, β-L2 penalties) provably prevent forward-KL mass collapse on new-only data without replay?
Sequential tasks (T > 2): How do mass and drift accumulate over many tasks with varying αt, and what schedules (replay, anchors, demonstrators) control long-horizon retention?
Multi-modal “new” task: If p_n itself is multimodal, do on-policy reverse-KL updates still confine old-mode drift via overlap, and how does misassignment compound across multiple new sub-modes?
Mapping theory to LMs: How can δ or Bhattacharyya overlap be estimated in discrete, high-dimensional sequence spaces, and what are practical proxies for responsibilities and overlap in LLMs?
Practical targets for reverse-KL: Reverse-KL is aligned with p_α, but in RLHF/feedback settings the target is only indirectly specified; how do reward misspecification, proxy rewards, or shaping affect mass retention and drift?
Replay design: The paper distinguishes numerator vs denominator replay for SFT; what are optimal replay schedules (mixing ratios, adaptive sampling) under compute/memory constraints, and what are minimal memory requirements to achieve a target β?
Model-based replay: If replay uses samples from the current or a frozen model approximation of p_o, how do approximation errors affect mass retention and drift bounds?
Stronger guarantees for SDFT/TTT/OAPL: The mixture analysis yields conditions like “sufficiently strong anchor/demonstrator”; can these be translated into measurable algorithmic hyperparameters with tight, testable thresholds in practical settings?
Anchors and references with partial support: For OAPL-like methods, what happens when the frozen reference underweights or misses parts of the old mode; can additional mechanisms recover missing support without reintroducing forgetting?
High-dimensional scaling: As d grows, typical separations δ and overlap statistics change; can the exponential-in-δ² bounds be reframed in terms of dimension-dependent quantities, and when do they become vacuous/tight?
Optimization discretization effects: The analysis uses gradient flow; how do discrete steps, adaptivity (Adam), and large learning rates alter mass collapse or drift, and can step-size schedules guarantee retention?
Nonstationary settings: If p_o drifts slowly (e.g., evolving safety constraints) while learning p_n, how should α and objectives be adapted to track moving old behavior without catastrophic interference?
Identifiability and label switching: In realistic models, component identities may swap; what procedures (constraints, priors, architectural separation) guarantee consistent component assignment over time?
Early stopping and curriculum: Can time-limited SFT on new-only data retain nonzero old mass in practice; what training-time bounds quantify the trade-off between new-task learning and old-task mass collapse?
Alternative distances: Do Wasserstein/MMD-based objectives avoid mass collapse while controlling drift without requiring explicit replay or on-policy sampling?
Evaluation protocols: What empirical metrics operationalize “mass retention” and “old-component drift” in real LMs (beyond mixture models), and how should benchmarks be constructed to test overlap-gated predictions?
Adversarial/new-task design: What worst-case constructions maximize misassignment/drift under reverse-KL, and can robust objectives bound drift under adversarially chosen p_n?
Architectural mechanisms: How do modular architectures (adapters, LoRA, routing) change responsibilities and overlap, and can they be designed to provably reduce old-component drift?

View Paper Prompt View All Prompts

Practical Applications

Overview

This paper provides a principled, quantitative account of when and why catastrophic forgetting occurs in post-training of generative models under a two-mode “old vs. new” abstraction. Key findings:

Forward-KL (e.g., new-only SFT) drives mass forgetting: the optimal old-mode weight collapses to zero unless old data are included in the training distribution.
Reverse-KL (e.g., KL-regularized RL or near-on-policy policy improvement) avoids mass forgetting and limits drift of the old component via overlap-gated misassignment probabilities that decay exponentially with mode separation (via the Bhattacharyya coefficient).
Replay plays fundamentally different roles: for forward-KL, it must alter the training distribution (numerator) to change the population optimum; for reverse-KL, it mitigates finite-batch “old-mode starvation” while keeping the population objective unchanged.
Near-on-policy methods (SDFT, TTT-Discover, OAPL) inherit reverse-KL-style stability under explicit conditions (e.g., demonstrator/anchor strength; coverage by frozen reference policy).

Below are concrete applications grouped by deployment horizon.

Immediate Applications

Use the following items to directly improve current post-training workflows across sectors. Each bullet includes sector tags and feasibility considerations.

Retention-aware choice of post-training objective
- Use reverse-KL-aligned updates (e.g., KL-regularized RL, on-policy distillation) when retaining prior capabilities is critical, since they provably keep old mass and bound drift by overlap.
- Sectors: software (LLMs, code models), healthcare (clinical assistants), education (tutors), robotics (policy updates), finance (risk-aware agents).
- Assumptions/dependencies: on-policy sampling infrastructure; a well-specified target mixture weight α (desired retention level); stable reward/teacher signals.
Correct replay strategy for SFT to prevent mass collapse
- If using forward-KL/SFT, mix old data directly into the training distribution at the desired fraction λ≈α (numerator replay). Model-side mixing alone only imposes an external floor and does not change the optimum.
- Sectors: LLM fine-tuning pipelines (customer adaptation), code completion systems, domain adaptation in enterprise NLP.
- Assumptions: access/rights to old data or high-fidelity synthetic replays; data engineering capacity to control batch composition.
Batch scheduling to prevent old-mode starvation in reverse-KL pipelines
- Maintain minibatch coverage of old behaviors and use bounded importance weights so stochastic gradients match the population reverse-KL gradient while avoiding starvation.
- Sectors: RLHF-style post-training, on-policy distillation (chat assistants, recommender systems), robotics.
- Assumptions: sampler that mixes on-policy and replayed samples; robust importance-weight clipping policies.
Hyperparameter rules for near-on-policy methods to retain old mass
- SDFT: ensure nonzero demonstrator strength; stronger demonstrators increase old-mass retention and bound drift.
- TTT-Discover: set KL anchor sufficiently strong; weak anchors can cause mode-seeking collapse.
- OAPL: choose a frozen reference policy that covers old modes with adequate weight to bound forgetting.
- Sectors: LLM post-training and self-training pipelines across industries.
- Assumptions: availability of demonstrator/teacher signals (SDFT), tuneable KL anchors (TTT), and an appropriate reference policy (OAPL).
Practical retention dashboards and alerts
- Track: estimated old-mode weight β, expected old responsibility on new data E[r_o(Y)|new], and overlap proxies (e.g., approximated Bhattacharyya coefficient).
- Trigger safeguards when β ≫ E[r_o] during SFT (indicates pressure toward collapse) or when minibatches lack old samples in reverse-KL.
- Sectors: MLOps for foundation models (platform teams), regulated deployments (healthcare, finance).
- Assumptions: tooling to estimate responsibilities (e.g., latent mixture probes, classifier approximations), logging governance.
Mixture-aware retention policies in deployment
- For safety-critical models, freeze the old component and learn the new component and β under reverse-KL-style objectives; expose α as a policy knob set by product or compliance teams.
- Sectors: healthcare decision support, financial advisory/chatbots, legal assistants, autonomous systems.
- Assumptions: mixture- or MoE-capable architectures; governance process to set α.
Data governance and budget allocation for replay
- Allocate replay storage/compute to numerator replay when using SFT; prioritize selective old samples representing critical behaviors to achieve desired α with minimal memory.
- Sectors: enterprise AI platforms, edge/on-device models (memory-limited keyboards, personalization).
- Assumptions: clear prioritization of “must-retain” behaviors; privacy constraints on storing old data.
Auditable retention for compliance
- Provide evidence (via β/α estimates and drift bounds) that updates did not erase validated behaviors; establish pre/post skill tests aligned with mixture targets.
- Sectors: healthcare (FDA/CE audits), finance (model risk management), education (curriculum coverage).
- Assumptions: documented target α; test suites spanning old/new distributions.
Safer skill-accumulation in robotics and embodied AI
- Use reverse-KL policy improvement anchored to past policies to learn new skills without degrading old ones; schedule replay for rare but safety-critical prior states.
- Sectors: industrial automation, warehouse robotics, autonomous driving (components).
- Assumptions: stable environment for on-policy sampling; safety constraints around exploration.
Code and language expansion without regression
- When adding new languages/frameworks to code/LLMs, apply reverse-KL or SFT with numerator replay at the desired retention ratio; monitor drift on canonical old tasks.
- Sectors: developer tools, multilingual assistants.
- Assumptions: curated old task corpora for replay; evaluation harnesses.

Long-Term Applications

These items require further research, scaling, or infrastructure development to fully realize.

Automated retention controllers (β/α managers)
- Adaptive systems that learn and enforce α across tasks by monitoring performance, overlap statistics, and validation risk, switching between reverse-KL and SFT+replay as conditions change.
- Dependencies: reliable online estimation of overlap (e.g., BC proxies) and task-specific utility/risk trade-offs.
Generalization to many modes and complex distributions
- Extend the two-mode analysis to multi-task, multi-domain, or MoE models with non-Gaussian/heteroscedastic components; develop scalable estimators of per-mode responsibilities and overlap.
- Dependencies: identifiable mixture structure, robust gating/assignment in high dimensions.
Retention-aware curriculum and scheduling
- Design curricula that modulate training order and α based on estimated separation δ; when δ is small (high overlap), emphasize stronger anchors/replay to prevent drift.
- Dependencies: automated δ/overlap measurement and curriculum optimization.
Privacy-preserving replay alternatives
- Develop on-policy, teacher-forced, or synthetic replay mechanisms that approximate numerator replay benefits without storing sensitive data; audit their equivalence to data replay for SFT and their variance properties for reverse-KL.
- Dependencies: generative fidelity and bias controls; legal frameworks for synthetic data.
Standardized retention benchmarks and certifications
- Create sector-specific standards quantifying mass retention (β≈α) and old-component drift under updates; require reporting in regulated industries.
- Dependencies: public datasets/benchmarks spanning old/new behaviors; consensus on acceptable drift thresholds.
Productized “retention-aware trainer” stacks
- Tooling that implements: (a) SFT with numerator replay at target α, (b) reverse-KL with old-mode starvation guards, (c) anchor calibration for SDFT/TTT/OAPL, and (d) dashboards for β, responsibilities, and drift.
- Dependencies: integration into common training frameworks; vendor support for on-policy sampling.
Dynamic reference policy management (OAPL-style)
- Systems for selecting/updating frozen reference policies to maintain coverage of legacy capabilities while enabling improvement on new tasks.
- Dependencies: policy/version registries; monitoring for coverage gaps.
Robust estimators of forgetting in black-box models
- Methods to infer β, responsibilities, and drift from inputs/outputs when internal access is limited (e.g., closed-weight models), enabling third-party audits and safety checks.
- Dependencies: probe models, calibrated uncertainty estimates, task-specific test construction.
Cross-domain governance policies for continual learning
- Regulatory guidance tying model update approvals to measurable retention (mass and drift), replay practices, and transparency about objective direction (forward vs reverse KL).
- Dependencies: interdisciplinary consensus; impact assessments for different sectors.
Resource-aware scheduling at scale
- Cluster schedulers that jointly optimize on-policy sampling, replay quotas, and compute costs while meeting retention SLAs across many concurrent model updates.
- Dependencies: systems support for mixed data pipelines and priority queues.

Notes on Key Assumptions and Dependencies

The core theory uses a two-mode abstraction and equal-covariance Gaussians; qualitative results extend to finite mixtures and strongly log-concave families (as indicated), but real models may deviate.
Reverse-KL benefits depend on on-sizeable on-policy sampling and stable target specification; SFT retention depends on the ability to mix old data into the training distribution.
Overlap-controlled drift relies on meaningful mode separation; when separation is small, stronger anchors or higher replay ratios are required.
Estimating β and responsibilities in practice requires either mixture-capable architectures (e.g., MoE) or proxy estimators/classifiers.
Legal and privacy constraints may limit storing or replaying old data; synthetic or on-policy replays can mitigate but require validation.

View Paper Prompt View All Prompts

Glossary

affine reparameterization: A transformation of parameters by an affine (linear plus shift) mapping, often used to re-express a model without changing its family. "it is exactly the original mixture family with an affine reparameterization"
Bhattacharyya coefficient: A symmetric measure of overlap between two probability distributions, used to bound misassignment/overlap effects. "controlled by the Bhattacharyya coefficient,"
bounded importance weighting: Capping or constraining importance weights to stabilize estimators or gradients when correcting sampling bias. "through bounded importance weighting."
catastrophic forgetting: A phenomenon in continual learning where performance on previously learned tasks degrades sharply after training on new tasks. "catastrophic forgetting, where performance on earlier tasks rapidly degrades."
disjoint-support: A setting where different components of a mixture have non-overlapping supports in the sample space. "In this disjoint-support limit, both forward and reverse-KL admit exact decompositions into a mixture-weight term and within-mode terms."
divergence-minimization: Training viewed as minimizing a divergence between model and target distributions (e.g., KL), aligning the model with desired behavior. "by viewing training procedures as a divergence-minimization or distribution-matching step"
exponential convergence: A rate of convergence where error decays proportionally to an exponential function of iterations/time. "a locally well-conditioned geometry with exponential convergence."
exponential tilt: Reweighting a base distribution or policy by an exponential factor (often of rewards or utilities) to form a new target. "its target is an exponential tilt of a frozen reference policy:"
f-divergence: A class of divergences between probability distributions defined via a convex function f, generalizing KL and others. "extensions to a class of $f$ -divergence is provided"
forward-KL: The Kullback–Leibler divergence KL(p‖q) used as an objective, aligning the model to the data distribution. "The forward-KL is the population analogue of maximum likelihood on a ``data'' distribution $p$ ."
gradient flow: Continuous-time limit of gradient descent dynamics described by an ODE following the negative gradient direction. "under logit-parameterized gradient flow, the trajectory $\beta(t)$ (corresponding to the population training objective) decreases monotonically to $0$."
Hessian-Lipschitz constant: A bound on how quickly the Hessian changes with parameters, used to control local curvature variation. "be the Hessian-Lipschitz constant"
KL anchor: An explicit KL penalty to a reference distribution that keeps updates from drifting too far, preventing mode collapse. "without a sufficiently strong KL anchor it can still collapse mass"
KL divergence: A measure of difference between two probability distributions, asymmetric and widely used in learning objectives. "forward and reverse KL divergence based training objectives"
KL-regularized: An objective augmented with a KL penalty to a reference/target distribution to stabilize or constrain policy updates. "KL-regularized on-policy RL updates toward a target distribution"
logit: The inverse-sigmoid parameterization mapping probabilities in (0,1) to real numbers, often used for mixture weights. "Let $\phi\in\mathbb{R}$ be a logit with $\beta=\sigma(\phi)$ "
Mahalanobis separation: Distance between means scaled by covariance, measuring separation of Gaussian components. "Define the Mahalanobis separation as"
mass forgetting: Collapse of the old component’s mixture mass to zero during training, eliminating its contribution. "Mass Forgetting (Mass Collapse): This occurs when the optimal mixture weight satisfies"
misassignment probabilities: Probabilities that samples from one component are (soft-)assigned to another under the model/target responsibilities. "the only terms capable of moving the old mode arise from misassignment probabilities"
mixture weight: The scalar coefficient(s) in a mixture model that allocate probability mass to each component. "the parameters to be learned consist of the mixture weight ( $\beta$ )"
mode-seeking: An objective’s tendency to concentrate mass on high-density or high-reward modes rather than covering all modes. "TTT-Discoverâs entropic objective is intrinsically mode-seeking:"
OAPL: A near-on-policy post-training method that updates toward an exponentially tilted reference policy. "OAPL behaves differently because its target is an exponential tilt of a frozen reference policy:"
off-policy: Learning using data generated by a behavior different from the current model/policy, which can cause mismatch issues. "catastrophic forgetting as mass collapse driven by off-policy training."
old-component drift: Parameter shift of the previously learned component away from its original (correct) distribution during new training. "Old-Component Drift: This occurs when, during continual training, the parameters of the learned old component $q_\mathrm{o}$ drift away from the true old distribution $p_\mathrm{o$."
old-mode starvation: A stochastic failure mode where minibatches rarely include samples from the old mode, impeding its maintenance. "prevent finite-batch âold-mode starvationâ through bounded importance weighting."
on-policy: Learning using data sampled from the current model/policy, aligning training distribution with the learner. "on-policy approaches that continually sample from the current model and train on the resulting data are widely used."
operator norm: The induced norm of a linear operator (matrix), equal to its largest singular value. "with $\|\cdot\|_2$ denoting the operator norm"
PAC-Bayes bounds: Generalization bounds derived via PAC-Bayesian theory, characterizing performance with prior/posterior distributions. "PAC-Bayes bounds for continual learning were established"
PL (Polyak–Łojasiewicz) analysis: A condition/analysis ensuring error decays geometrically if the gradient norm lower-bounds suboptimality. "a local Polyak--\L{}ojasiewicz (PL) analysis shows that"
population objective: The loss defined with respect to the true underlying data distribution (as opposed to finite samples). "replay leaves the population objective unchanged"
population optimum: The parameter setting that minimizes the population-level loss (expectation over the true distribution). "the population optimum satisfies $\beta^\star>0$ "
responsibility (mixture models): The posterior probability that a sample belongs to a given mixture component under the model. "Define the model responsibilities (posterior component probabilities under $q$ ):"
replay: Reusing past data (or samples from past policies) during training to mitigate forgetting. "Effect of Replay on SFT and RL."
SDFT: A self-distillation post-training method that updates using an evolving teacher generated from the model plus a demonstrator. "SDFT behaves like a reverse-KL update toward an evolving teacher distribution"
SFT (Supervised Fine-Tuning): Post-training via supervised objectives, often maximizing likelihood on labeled or preference-derived data. "In the context of the model above, forward-KL correspond to SFT-based training with only new data"
strongly log-concave densities: Distributions whose negative log-densities are strongly convex, implying favorable concentration and optimization properties. "extensions to finite-mixture and strongly log-concave densities are provided"
TTT-Discover: A near-on-policy post-training method using an entropic objective with a KL anchor to balance exploration and retention. "TTT-Discover~\cite{yuksekgonul2026learning}"
two-mode mixture model: A simplified abstraction with one “old” and one “new” component used to analyze forgetting behavior. "We aim to answer this question by studying the two-mode mixture model"

A Quantitative Characterization of Forgetting in Post-Training

Summary

A Quantitative Characterization of Forgetting in Post-Training

Introduction and Motivation

Formal Model and Definitions

Forward-KL versus Reverse-KL: Analytical Results

Forward-KL (SFT, New-Data-Only):

Reverse-KL (KL-Regularized RL or On-Policy):

Replay and Data Exposure: Disparate Roles under Different Objectives

Application to Recent Near-On-Policy Post-Training Methods

Theoretical and Practical Implications

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Overview: What this paper is about

The key questions, in plain language

How they studied it (simple analogy + approach)

Main findings (what they discovered and why it matters)

Why this matters (implications and impact)

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Overview

Immediate Applications

Long-Term Applications

Notes on Key Assumptions and Dependencies

Glossary

Open Problems

Continue Learning

Collections

Tweets

Don't miss out on important new AI/ML research

A Quantitative Characterization of Forgetting in Post-Training

Summary

A Quantitative Characterization of Forgetting in Post-Training

Introduction and Motivation

Formal Model and Definitions

Forward-KL versus Reverse-KL: Analytical Results

Forward-KL (SFT, New-Data-Only):

Reverse-KL (KL-Regularized RL or On-Policy):

Replay and Data Exposure: Disparate Roles under Different Objectives

Application to Recent Near-On-Policy Post-Training Methods

Theoretical and Practical Implications

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Overview: What this paper is about

The key questions, in plain language

How they studied it (simple analogy + approach)

Main findings (what they discovered and why it matters)

Why this matters (implications and impact)

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Overview

Immediate Applications

Long-Term Applications

Notes on Key Assumptions and Dependencies

Glossary

Open Problems

Continue Learning

Collections

Tweets

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research