Learning to Forget: Continual Learning with Adaptive Weight Decay

Published 29 Apr 2026 in cs.LG and cs.NE | (2604.27063v1)

Abstract: Continual learning agents with finite capacity must balance acquiring new knowledge with retaining the old. This requires controlled forgetting of knowledge that is no longer needed, freeing up capacity to learn. Weight decay, viewed as a mechanism for forgetting, can serve this role by gradually discarding information stored in the weights. However, a fixed scalar weight decay drives this forgetting uniformly over time and uniformly across all parameters, even when some encode stable knowledge while others track rapidly changing targets. We introduce Forgetting through Adaptive Decay (FADE), which adapts per-parameter weight decay rates online via approximate meta-gradient descent. We derive FADE for the online linear setting and apply it to the final layer of neural networks. Our empirical analysis shows that FADE automatically discovers distinct decay rates for different parameters, complements step-size adaptation, and consistently improves over fixed weight decay across online tracking and streaming classification problems.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper introduces FADE, a meta-gradient descent method that dynamically adjusts per-parameter weight decay in online continual learning.
It demonstrates superior performance across regression, teacher-student, and label permutation benchmarks with lower MSE and higher accuracy than fixed decay baselines.
The method decouples learning and decay rates, providing robust control over the stability-plasticity tradeoff and reducing sensitivity to initialization.

Learning to Forget in Continual Learning: FADE—Forgetting through Adaptive Decay

Introduction and Motivation

Continual learning systems are inherently constrained by finite model capacity and the stability-plasticity dilemma: agents must retain previously acquired, useful knowledge while remaining capable of rapidly integrating novel information. Extant approaches frequently address catastrophic forgetting through architectural interventions, replay buffers, or regularization-based penalties, but principled mechanisms for controlled, judicious forgetting remain underexplored at the weight level, particularly with relevance to non-stationary online learning. Traditionally, weight decay has served as a uniform inductive bias toward simplification, but its role as an explicit, parameter-selective forgetting mechanism in continual learning is poorly understood and inadequately leveraged.

"Learning to Forget: Continual Learning with Adaptive Weight Decay" (2604.27063) introduces FADE (Forgetting through Adaptive Decay), a framework that adaptively tunes per-parameter weight decay rates online through meta-gradient descent. The core methodology diverges from fixed, scalar decay schedules by enabling meta-learned, dynamic, parameterwise forgetting, yielding substantial improvements across sequential learning benchmarks.

Methodology: Adaptive Weight Decay via Meta-Gradients

FADE is derived for the online linear regression setting, with direct extensibility to the final layer (“head”) of general neural architectures. Each parameter $w_i$ is regularized with an exponential decay rate $\lambda_i = \exp(y_i)$ , where $y_i$ is a meta-learned parameter. Adaptation is realized by differentiating the prediction loss with respect to $y_i$ via a forward-mode meta-gradient trace $g_i$ , using an IDBD-style approximation: each meta-parameter update influences only its corresponding weight. This trace is updated online alongside $w_i$ .

The core update rules are:

Per-parameter decay: $\lambda_i = \exp(y_i)$
Meta-parameter update: $y_{i,t+1} = y_{i,t} + \beta \cdot \delta_t x_{i,t} g_{i,t}$
Sensitivity trace: $g_{i, t+1} = [g_{i,t}(1 - \lambda_{i,t+1} - \alpha x_{i,t}^2) - \lambda_{i,t+1} w_{i,t}]_+$
Weight update: $w_{i, t+1} = (1-\lambda_{i,t+1}) w_{i,t} + \alpha \delta_t x_{i,t}$

FADE compounds minimal overhead ( $\lambda_i = \exp(y_i)$ 0 per step) and includes only one new scalar hyperparameter (meta-step-size $\lambda_i = \exp(y_i)$ 1) per parameter. The meta-gradient derivatives driving $\lambda_i = \exp(y_i)$ 2 are based on observed total prediction loss across time, as in forward-mode meta-learning [Xu et al., 2018].

In deep non-linear networks, the meta-gradient-based adaptation is applied to the final layer logits, paired with any traditional optimizer (SGD/Adam) on the hidden representations. Decay rate adaptation coexists with per-parameter step size adaptation (e.g., IDBD), yielding a disentangled mechanism for both plasticity and forgetting.

Empirical Results

FADE is evaluated on a suite of online continual learning scenarios, showing robust, significant improvements over baselines:

1. Online Linear Tracking

On a 20-dimensional online regression problem with periodic non-stationarity, FADE learns to assign high decay ( $\lambda_i = \exp(y_i)$ 3) to weights encoding irrelevant or volatile features, and low decay ( $\lambda_i = \exp(y_i)$ 4) to stably informative weights. Compared to SGD with fixed decay and IDBD, FADE + IDBD achieves the lowest mean squared error (MSE): $\lambda_i = \exp(y_i)$ 5 in the noiseless case, outperforming either adaptive decay or adaptive step size alone. Adaptive decay and adaptive step size are empirically complementary.

2. Nonlinear Teacher-Student Tracking

A neural teacher model with layered non-stationarity (stable, slow-changing, and fast-changing outputs) is used to produce streaming targets for a student MLP. Applying FADE to the final layer:

FADE+SGD achieves an MSE of $\lambda_i = \exp(y_i)$ 6 over the final 500,000 steps, outperforming AdamW ( $\lambda_i = \exp(y_i)$ 7) by nearly a factor of two.
Per-group MSE curves show that FADE dynamically adjusts decay rates across outputs with distinct temporal statistics, outperforming any fixed decay initialization.

FADE also demonstrates superior robustness to the initial decay rate and meta-step size, in marked contrast to the brittle performance of fixed head decay schedules.

3. Streaming Image Classification under Recurrent Label Permutations

The label-permuted EMNIST benchmark imposes abrupt, global non-stationarity by permuting class labels every 2500 steps:

FADE+SGD achieves $\lambda_i = \exp(y_i)$ 8 average online accuracy over $\lambda_i = \exp(y_i)$ 9 million steps, substantially exceeding weight-clipping ( $y_i$ 0), SGD+head decay ( $y_i$ 1 at optimal initialization, $y_i$ 2 worst case), and AdamW ( $y_i$ 3).
FADE’s adaptivity to the decay parameter leads to strong performance even under extremely poor initializations, while fixed-decay SGD exhibits highly variable performance.

Partial label permutation, with some classes remaining stable, further highlights FADE’s adaptivity; FADE+SGD reaches $y_i$ 4 accuracy, outperforming all baselines.

Analysis of Scope and Limits

Applying FADE to hidden layers using the linear meta-gradient approximation provides only moderate gains, with clear performance gaps relative to head-only adaptation. This suggests that naively extending FADE to deep, non-linear stacks is insufficient, motivating future meta-gradient design that accounts for inter-layer non-linear interactions.

FADE stands out from prior adaptive decay approaches that focus on stationary batch training for regularization [Ishii & Sato, 2017; Nakamura & Hong, 2019; Xie et al., 2023]. It is architected specifically for online, non-stationary, task-ambiguous environments, and leverages meta-gradient descent at the per-parameter level rather than global or module-wise heuristics. The framework extends the principle of learned 'forget gates' in recurrent activations [Gers et al., 2000] to long-term weight memory. Previous meta-gradients for step size (IDBD) [Sutton, 1992] and contemporary online meta-learning [Xu et al., 2018] are direct methodological antecedents; FADE generalizes their mechanism to the control of forgetting.

A key empirical observation is that targeted fixed decay—applied exclusively to the head—forms an unexpectedly strong baseline in label permutation settings, although its sensitivity to initialization renders it impractical without adaptivity.

Theoretical and Practical Implications

FADE establishes that the stability-plasticity tradeoff in continual learning can be significantly improved by augmenting parameter dynamics with adaptive, judicious forgetting at the level of individual weights. The decoupling of learning rate (plasticity) and decay rate (forgetting horizon) introduces a differentiated, interpretable control mechanism, which is empirically robust and highly competitive.

Practically, the method incurs only minor additional overhead (two scalars per parameter) and makes minimal assumptions about architecture or data distribution. Because it slots into extant optimizers, it is readily deployable in streaming and reinforcement learning environments where task boundaries and future distributions are unknown or undefined.

Directions for Future Research

Meta-gradient mechanisms for deep, highly non-linear hidden representations: Generalizing FADE’s success from the final linear head to hidden layers remains a challenging open problem, requiring more sophisticated meta-approximations invariant to nonlinearity and depth.
Continual learning in high-capacity architectures: Integrating FADE in transformers, deep convolutional nets, or attention modules could address capacity constraints and non-stationary regimes beyond simple regression/classification.
Reinforcement learning and real-world non-stationarity: FADE is well-positioned for adoption in RL agents facing dynamic environments, or in autonomous systems where lifetime adaptability is crucial.

Conclusion

FADE provides a novel, effective approach for parameter-selective forgetting in continual learning by meta-learning per-parameter decay rates through forward-mode differentiation. It offers compelling improvements both in terms of aggregate accuracy and robustness to initialization, establishing a new state-of-the-art in several continual learning benchmarks. More broadly, FADE recontextualizes weight decay—from an undifferentiated regularizer to a targeted, data-driven forgetting mechanism—bridging meta-learning and continual adaptation in non-stationary online learning (2604.27063).

Markdown Report Issue