- The paper introduces FADE, a meta-gradient descent method that dynamically adjusts per-parameter weight decay in online continual learning.
- It demonstrates superior performance across regression, teacher-student, and label permutation benchmarks with lower MSE and higher accuracy than fixed decay baselines.
- The method decouples learning and decay rates, providing robust control over the stability-plasticity tradeoff and reducing sensitivity to initialization.
Learning to Forget in Continual Learning: FADE—Forgetting through Adaptive Decay
Introduction and Motivation
Continual learning systems are inherently constrained by finite model capacity and the stability-plasticity dilemma: agents must retain previously acquired, useful knowledge while remaining capable of rapidly integrating novel information. Extant approaches frequently address catastrophic forgetting through architectural interventions, replay buffers, or regularization-based penalties, but principled mechanisms for controlled, judicious forgetting remain underexplored at the weight level, particularly with relevance to non-stationary online learning. Traditionally, weight decay has served as a uniform inductive bias toward simplification, but its role as an explicit, parameter-selective forgetting mechanism in continual learning is poorly understood and inadequately leveraged.
"Learning to Forget: Continual Learning with Adaptive Weight Decay" (2604.27063) introduces FADE (Forgetting through Adaptive Decay), a framework that adaptively tunes per-parameter weight decay rates online through meta-gradient descent. The core methodology diverges from fixed, scalar decay schedules by enabling meta-learned, dynamic, parameterwise forgetting, yielding substantial improvements across sequential learning benchmarks.
FADE is derived for the online linear regression setting, with direct extensibility to the final layer (“head”) of general neural architectures. Each parameter wi is regularized with an exponential decay rate λi=exp(yi), where yi is a meta-learned parameter. Adaptation is realized by differentiating the prediction loss with respect to yi via a forward-mode meta-gradient trace gi, using an IDBD-style approximation: each meta-parameter update influences only its corresponding weight. This trace is updated online alongside wi.
The core update rules are:
- Per-parameter decay: λi=exp(yi)
- Meta-parameter update: yi,t+1=yi,t+β⋅δtxi,tgi,t
- Sensitivity trace: gi,t+1=[gi,t(1−λi,t+1−αxi,t2)−λi,t+1wi,t]+
- Weight update: wi,t+1=(1−λi,t+1)wi,t+αδtxi,t
FADE compounds minimal overhead (λi=exp(yi)0 per step) and includes only one new scalar hyperparameter (meta-step-size λi=exp(yi)1) per parameter. The meta-gradient derivatives driving λi=exp(yi)2 are based on observed total prediction loss across time, as in forward-mode meta-learning [Xu et al., 2018].
In deep non-linear networks, the meta-gradient-based adaptation is applied to the final layer logits, paired with any traditional optimizer (SGD/Adam) on the hidden representations. Decay rate adaptation coexists with per-parameter step size adaptation (e.g., IDBD), yielding a disentangled mechanism for both plasticity and forgetting.
Empirical Results
FADE is evaluated on a suite of online continual learning scenarios, showing robust, significant improvements over baselines:
1. Online Linear Tracking
On a 20-dimensional online regression problem with periodic non-stationarity, FADE learns to assign high decay (λi=exp(yi)3) to weights encoding irrelevant or volatile features, and low decay (λi=exp(yi)4) to stably informative weights. Compared to SGD with fixed decay and IDBD, FADE + IDBD achieves the lowest mean squared error (MSE): λi=exp(yi)5 in the noiseless case, outperforming either adaptive decay or adaptive step size alone. Adaptive decay and adaptive step size are empirically complementary.
2. Nonlinear Teacher-Student Tracking
A neural teacher model with layered non-stationarity (stable, slow-changing, and fast-changing outputs) is used to produce streaming targets for a student MLP. Applying FADE to the final layer:
- FADE+SGD achieves an MSE of λi=exp(yi)6 over the final 500,000 steps, outperforming AdamW (λi=exp(yi)7) by nearly a factor of two.
- Per-group MSE curves show that FADE dynamically adjusts decay rates across outputs with distinct temporal statistics, outperforming any fixed decay initialization.
FADE also demonstrates superior robustness to the initial decay rate and meta-step size, in marked contrast to the brittle performance of fixed head decay schedules.
3. Streaming Image Classification under Recurrent Label Permutations
The label-permuted EMNIST benchmark imposes abrupt, global non-stationarity by permuting class labels every 2500 steps:
- FADE+SGD achieves λi=exp(yi)8 average online accuracy over λi=exp(yi)9 million steps, substantially exceeding weight-clipping (yi0), SGD+head decay (yi1 at optimal initialization, yi2 worst case), and AdamW (yi3).
- FADE’s adaptivity to the decay parameter leads to strong performance even under extremely poor initializations, while fixed-decay SGD exhibits highly variable performance.
Partial label permutation, with some classes remaining stable, further highlights FADE’s adaptivity; FADE+SGD reaches yi4 accuracy, outperforming all baselines.
Analysis of Scope and Limits
Applying FADE to hidden layers using the linear meta-gradient approximation provides only moderate gains, with clear performance gaps relative to head-only adaptation. This suggests that naively extending FADE to deep, non-linear stacks is insufficient, motivating future meta-gradient design that accounts for inter-layer non-linear interactions.
FADE stands out from prior adaptive decay approaches that focus on stationary batch training for regularization [Ishii & Sato, 2017; Nakamura & Hong, 2019; Xie et al., 2023]. It is architected specifically for online, non-stationary, task-ambiguous environments, and leverages meta-gradient descent at the per-parameter level rather than global or module-wise heuristics. The framework extends the principle of learned 'forget gates' in recurrent activations [Gers et al., 2000] to long-term weight memory. Previous meta-gradients for step size (IDBD) [Sutton, 1992] and contemporary online meta-learning [Xu et al., 2018] are direct methodological antecedents; FADE generalizes their mechanism to the control of forgetting.
A key empirical observation is that targeted fixed decay—applied exclusively to the head—forms an unexpectedly strong baseline in label permutation settings, although its sensitivity to initialization renders it impractical without adaptivity.
Theoretical and Practical Implications
FADE establishes that the stability-plasticity tradeoff in continual learning can be significantly improved by augmenting parameter dynamics with adaptive, judicious forgetting at the level of individual weights. The decoupling of learning rate (plasticity) and decay rate (forgetting horizon) introduces a differentiated, interpretable control mechanism, which is empirically robust and highly competitive.
Practically, the method incurs only minor additional overhead (two scalars per parameter) and makes minimal assumptions about architecture or data distribution. Because it slots into extant optimizers, it is readily deployable in streaming and reinforcement learning environments where task boundaries and future distributions are unknown or undefined.
Directions for Future Research
- Meta-gradient mechanisms for deep, highly non-linear hidden representations: Generalizing FADE’s success from the final linear head to hidden layers remains a challenging open problem, requiring more sophisticated meta-approximations invariant to nonlinearity and depth.
- Continual learning in high-capacity architectures: Integrating FADE in transformers, deep convolutional nets, or attention modules could address capacity constraints and non-stationary regimes beyond simple regression/classification.
- Reinforcement learning and real-world non-stationarity: FADE is well-positioned for adoption in RL agents facing dynamic environments, or in autonomous systems where lifetime adaptability is crucial.
Conclusion
FADE provides a novel, effective approach for parameter-selective forgetting in continual learning by meta-learning per-parameter decay rates through forward-mode differentiation. It offers compelling improvements both in terms of aggregate accuracy and robustness to initialization, establishing a new state-of-the-art in several continual learning benchmarks. More broadly, FADE recontextualizes weight decay—from an undifferentiated regularizer to a targeted, data-driven forgetting mechanism—bridging meta-learning and continual adaptation in non-stationary online learning (2604.27063).