Papers
Topics
Authors
Recent
2000 character limit reached

CLAW: Adaptive Weights for Continual Learning

Updated 4 December 2025
  • CLAW is a framework that adaptively blends neural network weights to enable continual learning and reduce catastrophic forgetting.
  • It utilizes layer- and neuron-specific, data-driven weighting mechanisms to smartly balance stability with task-specific plasticity.
  • Empirical studies show that CLAW methods enhance accuracy by 5–20% and reduce forgetting by up to 70 percentage points across benchmarks.

Continual Learning with Adaptive Weights (CLAW) comprises a set of algorithmic frameworks that enable neural networks to sequentially acquire new tasks by adaptively interpolating or modifying their parameters, thereby balancing knowledge retention (stability) and task adaptation (plasticity). These methods operate across various architectures, learning scenarios (task-incremental, class-incremental, few-shot), and optimization paradigms (model averaging, meta-learning, Bayesian gating), but are unified in their use of data-driven, instance- or layer-specific weighting for parameter update or sharing. CLAW approaches significantly mitigate catastrophic forgetting and realize state-of-the-art continual learning performance across multiple benchmarks (Mao et al., 24 Sep 2025, Marouf et al., 2023, Kozal et al., 5 Apr 2024, Adel et al., 2019).

1. Formal Problem Statement and Motivation

Continual learning targets the learning of a sequence of TT tasks {Dt}t=1T\{\mathcal{D}_t\}_{t=1}^T in an online regime, where only data from the current task is available for training and previous-task data is not revisited (or only a small buffer is stored). The main obstacles are:

  • Catastrophic Forgetting: After updating on task tt, performance on tasks 1,,t11,\dots,t-1 can degrade substantially.
  • Stability-Plasticity Dilemma: Maintaining performance on old (stability) and acquiring new knowledge (plasticity) are often at odds.

CLAW algorithms address this by adaptively determining where and how much to share, interpolate, or adapt model parameters across tasks and layers. Some employ learnable gates (binary/continuous) per neuron or per layer (Adel et al., 2019); others interpolate weights globally, per-layer, or per-parameter using learned or data-driven coefficients (Mao et al., 24 Sep 2025, Marouf et al., 2023, Kozal et al., 5 Apr 2024).

2. Adaptive Weight Fusion Strategies

CLAW methods instantiate adaptive weight updates under diverse mechanisms:

  • For task tt, computes two sets of parameters: old (Θt1\Theta_{t-1}) and new (Θ^t\hat\Theta_t), after training on DtD_t.
  • Employs a small multilayer perceptron (MLP) “mixing coefficient generator” gΦtg_{\Phi_t} that maps layerwise gradient summaries Gt={θ^tLtask}=1LG_t = \{\nabla_{\hat\theta_t^\ell} L_\text{task}\}_{\ell=1}^L to mixing coefficients αt[0,1]\alpha_t^\ell \in [0,1].
  • Fused weights per layer \ell:

θt=αtθ^t+(1αt)θt1\theta_t^\ell = \alpha_t^\ell\,\hat\theta_t^\ell + (1-\alpha_t^\ell)\,\theta_{t-1}^\ell

This enables task- and layer-level trade-offs—layers with minimal change retain old knowledge, task-critical layers adapt (Mao et al., 24 Sep 2025).

  • CoMA: Linear interpolation at whole-network scale:

θt=λθt+(1λ)θt1,λ[0,1]\theta^*_t = \lambda\,\theta_t + (1-\lambda)\,\theta^*_{t-1},\qquad \lambda \in [0,1]

  • CoFiMA: Per-parameter Fisher-weighted blending:

θt,i=λFt,iθt,i+(1λ)Ft1,iθt1,iλFt,i+(1λ)Ft1,i\theta^*_{t,i} = \frac{ \lambda F_{t,i} \theta_{t,i} + (1-\lambda) F_{t-1,i} \theta^*_{t-1,i} }{ \lambda F_{t,i} + (1-\lambda) F_{t-1,i} }

Here, Ft,iF_{t,i} is the task-likelihood Fisher information (importance) for parameter ii.

  • After task tt is learned, finds a permutation alignment between the new and previous weights, then interpolates:

θT(+):=(1α)θT+απ(θP)\theta_T^{(+)} := (1-\alpha)\,\theta_T + \alpha\,\pi(\theta_P)

  • Interpolation coefficient α\alpha controls stability-plasticity; typically tuned per method (Kozal et al., 5 Apr 2024).
  • Each neuron is equipped with a binary adaptation gate zi,jtz_{i,j}^t (share/adapt) and a continuous adaptation strength ai,jta_{i,j}^t.
  • Effective weights for task tt:

wi,jt=wi,j[1+zi,jtsi,j1+eai,jt]w_{i,j}^t = w_{i,j} \left[ 1 + z_{i,j}^t \frac{s_{i,j}}{1 + e^{-a_{i,j}^t}} \right]

  • Variational inference infers the gate and strength distributions, jointly maximizing per-task data likelihood and minimizing changes in the parameter distribution (via KL regularization) (Adel et al., 2019).

3. Meta-Learning, Optimization, and Algorithmic Structure

CLAW frameworks span a spectrum of meta-learning and adaptation protocols:

  • Bilevel Optimization: In Meta-Weight-Ensembler, MLP generator parameters Φt\Phi_t are meta-learned via feedback on a buffer of prior-task data. Base-learner adapts to the new task by SGD; meta-learner adjusts Φt\Phi_t so that fused models perform well on previous tasks (Mao et al., 24 Sep 2025).
  • Simple Model Averaging: CoMA and CoFiMA require only taskwise weight updates plus (optionally) a Fisher estimation step. No meta-learning loop is required (Marouf et al., 2023).
  • Batch-norm Re-estimation: CLAW with weight interpolation requires a batch-norm statistics update post-interpolation to stabilize activations (Kozal et al., 5 Apr 2024).
  • Variational Bayesian Updates: CLAW with neuron-wise gates optimizes an evidence lower bound (ELBO) using gradient-based variational updates for both global and per-task adaptation parameters (Adel et al., 2019).

Pseudocode for Meta-Weight-Ensembler:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
for t = 1 to T:
    # Train new model on D_t
    init \hatΘ_t  Θ_{t-1}
    for s = 1S:
        \hatΘ_t  \hatΘ_t - η_task _{\hatΘ_t} L_task(\hatΘ_t; D_t)

    # Update buffer with samples from D_t

    # Meta-update mixing generator
    for m = 1M:
        G_t = _{\hatΘ_t} L_task(\hatΘ_t; D_t)
        α_t = g_{Φ_t}(G_t)
        Θ_t = α_t  \hatΘ_t + (1-α_t)  Θ_{t-1}
        L_meta(Φ_t) = L_task(Θ_t; ˆD)
        Φ_t  Φ_t - η_meta _{Φ_t} L_meta(Φ_t)

4. Empirical Results and Benchmark Performance

CLAW approaches have been evaluated on a host of task-incremental and class-incremental benchmarks, consistently demonstrating superior accuracy and reduced forgetting:

Method Testbed ACC (↑) BWT/Forgetting (↓) Notes
InfluenceCL CIFAR-100 Split 21.15% –73.24% (Mao et al., 24 Sep 2025) baseline
InfluenceCL + CLAW CIFAR-100 Split 27.50% –56.27% CLAW plug-in
BFP CIFAR-100 Split 47.45% –29.85%
BFP + CLAW CIFAR-100 Split 61.19% –26.91%
MEAT CIFAR-100 Split 9.68% –95.69%
MEAT + CLAW CIFAR-100 Split 21.97% –44.44%
ER (Replay only) CIFAR-100 Split 22.5% 65.6% (Kozal et al., 5 Apr 2024)
CLAW+ER CIFAR-100 Split 40.3% 12.8%
CoMA CIFAR-100, ViT 92.00% -- (Marouf et al., 2023)
CoFiMA CIFAR-100, ViT 92.77% -- Fisher-weighted
CLAW (Bayesian) PermutedMNIST 99.2±0.2% -- (Adel et al., 2019)
CLAW (Bayesian) CIFAR-100 Split 95.6±0.3% --

Across all settings, CLAW-based methods consistently improve test accuracy (by 5–20 percentage points) and reduce forgetting (measured by backward transfer or average drop, by 10–70 points) when used as a plug-in to state-of-the-art continual learning architectures (Mao et al., 24 Sep 2025, Kozal et al., 5 Apr 2024, Marouf et al., 2023, Adel et al., 2019).

5. Theoretical Analysis and Interpretability

Theoretical perspectives offered by CLAW variants include:

  • Stability–Plasticity Trade-off: The interpolation/mixing coefficient (α\alpha or per-layer α\alpha^\ell) embodies a direct, tunable control over how much stability (retaining old weights) vs. plasticity (adapting new weights) is enforced (Mao et al., 24 Sep 2025, Marouf et al., 2023, Kozal et al., 5 Apr 2024).
  • Meta-Regularization: Layerwise or neuronwise mixing can be cast as applying a learned regularization, enforcing minima in the parameter space that interpolate optimally between prior and current tasks in a non-uniform, data-driven manner (Mao et al., 24 Sep 2025, Adel et al., 2019).
  • Gradient-based Layerwise Relevance: By leveraging gradient statistics, the system infers which layers are critical for transfer or for task-specific adaptation, implementing per-layer specialization without rigid architectural constraints (Mao et al., 24 Sep 2025).
  • Variational Bayesian Gating: CLAW’s Bayesian instantiation implements soft (probabilistic) weight-sharing, meta-learned across tasks, yielding both interpretability (about which parts of the network are task-shared) and strong generalization (Adel et al., 2019).

CLAW mechanisms intersect with and extend several architectural and optimization strategies:

  • Soft Subnetworks and Binary Masking: Approaches like Soft-Winning SubNetworks and SoftNet leverage adaptive binary and soft masks to realize per-task subnetworks that reuse network weights efficiently, regularized by the Regularized Lottery Ticket Hypothesis (Kang et al., 2023).
  • Slot-based and Functional Regularization Methods: Bayesian CLAW generalizes fixed-architecture methods by allowing adaptive task-specific reweighting at the unit level (Adel et al., 2019).
  • Replay and Experience Rehearsal Methods: CLAW’s weight interpolation complements buffer-based replay, achieving dramatic gains in knowledge retention yet requiring only the addition of an extra model copy and minimal post-task computation (Kozal et al., 5 Apr 2024).
  • Adaptive SGD-style Methods: NCCL formulates an adaptive step size strategy for nonconvex continual learning, dynamically scaling updates to suppress catastrophic forgetting via closed-form control of cross-gradient interactions (Han et al., 8 Apr 2024).

7. Implementation, Memory Cost, and Hyperparameterization

  • Overhead: Most CLAW instantiations require either a copy of model parameters (for blending/fusion) or additional buffer for gradient summaries and mixing networks.
  • Computation: Permutation alignment (when needed) and batch-norm statistics update add mild cost compared to full retraining (Kozal et al., 5 Apr 2024).
  • Hyperparameters: The main tuning parameter is the mixing (or interpolation) coefficient, which directly controls the stability-plasticity compromise. Meta-learning variants add learning rates for generator optimization, but are robust across a broad range of values (Mao et al., 24 Sep 2025, Marouf et al., 2023, Adel et al., 2019).

8. Limitations and Future Directions

  • Scalability: Algorithms requiring instance-wise variational updates or per-layer fusion may incur higher computational cost compared to uniform averaging (Adel et al., 2019).
  • Task Granularity: Layer-level or neuron-level coefficients may be needed to resolve high-level features requiring strong sharing from localized task-specific adaptations (Mao et al., 24 Sep 2025).
  • Integration: Incorporating CLAW with functional, replay, and architectural expansion approaches is an ongoing research direction, as is hyperparameter automation (e.g., via hierarchical priors in Bayesian settings) (Adel et al., 2019).

In summary, CLAW (Continual Learning with Adaptive Weights) encompasses a principled set of solutions for continual learning that realize data-driven parameter blending or gating at task, layer, or neuron granularity. These methods maintain robust knowledge retention and transfer across a sequence of tasks by learning optimal fusion coefficients, with theoretical justification and empirical validation across diverse benchmarks (Mao et al., 24 Sep 2025, Marouf et al., 2023, Kozal et al., 5 Apr 2024, Adel et al., 2019).

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Continual Learning with Adaptive Weights (CLAW).