CLAW: Adaptive Weights for Continual Learning

Updated 4 December 2025

CLAW is a framework that adaptively blends neural network weights to enable continual learning and reduce catastrophic forgetting.
It utilizes layer- and neuron-specific, data-driven weighting mechanisms to smartly balance stability with task-specific plasticity.
Empirical studies show that CLAW methods enhance accuracy by 5–20% and reduce forgetting by up to 70 percentage points across benchmarks.

Continual Learning with Adaptive Weights (CLAW) comprises a set of algorithmic frameworks that enable neural networks to sequentially acquire new tasks by adaptively interpolating or modifying their parameters, thereby balancing knowledge retention (stability) and task adaptation (plasticity). These methods operate across various architectures, learning scenarios (task-incremental, class-incremental, few-shot), and optimization paradigms (model averaging, meta-learning, Bayesian gating), but are unified in their use of data-driven, instance- or layer-specific weighting for parameter update or sharing. CLAW approaches significantly mitigate catastrophic forgetting and realize state-of-the-art continual learning performance across multiple benchmarks (Mao et al., 24 Sep 2025, Marouf et al., 2023, Kozal et al., 5 Apr 2024, Adel et al., 2019).

1. Formal Problem Statement and Motivation

Continual learning targets the learning of a sequence of $T$ tasks $\{\mathcal{D}_t\}_{t=1}^T$ in an online regime, where only data from the current task is available for training and previous-task data is not revisited (or only a small buffer is stored). The main obstacles are:

Catastrophic Forgetting: After updating on task $t$ , performance on tasks $1,\dots,t-1$ can degrade substantially.
Stability-Plasticity Dilemma: Maintaining performance on old (stability) and acquiring new knowledge (plasticity) are often at odds.

CLAW algorithms address this by adaptively determining where and how much to share, interpolate, or adapt model parameters across tasks and layers. Some employ learnable gates (binary/continuous) per neuron or per layer (Adel et al., 2019); others interpolate weights globally, per-layer, or per-parameter using learned or data-driven coefficients (Mao et al., 24 Sep 2025, Marouf et al., 2023, Kozal et al., 5 Apr 2024).

2. Adaptive Weight Fusion Strategies

CLAW methods instantiate adaptive weight updates under diverse mechanisms:

For task $t$ , computes two sets of parameters: old ( $\Theta_{t-1}$ ) and new ( $\hat\Theta_t$ ), after training on $D_t$ .
Employs a small multilayer perceptron (MLP) “mixing coefficient generator” $g_{\Phi_t}$ that maps layerwise gradient summaries $G_t = \{\nabla_{\hat\theta_t^\ell} L_\text{task}\}_{\ell=1}^L$ to mixing coefficients $\alpha_t^\ell \in [0,1]$ .
Fused weights per layer $\ell$ :

$\theta_t^\ell = \alpha_t^\ell\,\hat\theta_t^\ell + (1-\alpha_t^\ell)\,\theta_{t-1}^\ell$

This enables task- and layer-level trade-offs—layers with minimal change retain old knowledge, task-critical layers adapt (Mao et al., 24 Sep 2025).

CoMA: Linear interpolation at whole-network scale:

$\theta^*_t = \lambda\,\theta_t + (1-\lambda)\,\theta^*_{t-1},\qquad \lambda \in [0,1]$

CoFiMA: Per-parameter Fisher-weighted blending:

$\theta^*_{t,i} = \frac{ \lambda F_{t,i} \theta_{t,i} + (1-\lambda) F_{t-1,i} \theta^*_{t-1,i} }{ \lambda F_{t,i} + (1-\lambda) F_{t-1,i} }$

Here, $F_{t,i}$ is the task-likelihood Fisher information (importance) for parameter $i$ .

After task $t$ is learned, finds a permutation alignment between the new and previous weights, then interpolates:

$\theta_T^{(+)} := (1-\alpha)\,\theta_T + \alpha\,\pi(\theta_P)$

Interpolation coefficient $\alpha$ controls stability-plasticity; typically tuned per method (Kozal et al., 5 Apr 2024).

Each neuron is equipped with a binary adaptation gate $z_{i,j}^t$ (share/adapt) and a continuous adaptation strength $a_{i,j}^t$ .
Effective weights for task $t$ :

$w_{i,j}^t = w_{i,j} \left[ 1 + z_{i,j}^t \frac{s_{i,j}}{1 + e^{-a_{i,j}^t}} \right]$

Variational inference infers the gate and strength distributions, jointly maximizing per-task data likelihood and minimizing changes in the parameter distribution (via KL regularization) (Adel et al., 2019).

3. Meta-Learning, Optimization, and Algorithmic Structure

CLAW frameworks span a spectrum of meta-learning and adaptation protocols:

Bilevel Optimization: In Meta-Weight-Ensembler, MLP generator parameters $\Phi_t$ are meta-learned via feedback on a buffer of prior-task data. Base-learner adapts to the new task by SGD; meta-learner adjusts $\Phi_t$ so that fused models perform well on previous tasks (Mao et al., 24 Sep 2025).
Simple Model Averaging: CoMA and CoFiMA require only taskwise weight updates plus (optionally) a Fisher estimation step. No meta-learning loop is required (Marouf et al., 2023).
Batch-norm Re-estimation: CLAW with weight interpolation requires a batch-norm statistics update post-interpolation to stabilize activations (Kozal et al., 5 Apr 2024).
Variational Bayesian Updates: CLAW with neuron-wise gates optimizes an evidence lower bound (ELBO) using gradient-based variational updates for both global and per-task adaptation parameters (Adel et al., 2019).

Pseudocode for Meta-Weight-Ensembler:

for t = 1 to T:
    # Train new model on D_t
    init \hatΘ_t ← Θ_{t-1}
    for s = 1…S:
        \hatΘ_t ← \hatΘ_t - η_task ∇_{\hatΘ_t} L_task(\hatΘ_t; D_t)

    # Update buffer with samples from D_t

    # Meta-update mixing generator
    for m = 1…M:
        G_t = ∇_{\hatΘ_t} L_task(\hatΘ_t; D_t)
        α_t = g_{Φ_t}(G_t)
        Θ_t = α_t ⊙ \hatΘ_t + (1-α_t) ⊙ Θ_{t-1}
        L_meta(Φ_t) = L_task(Θ_t; ˆD)
        Φ_t ← Φ_t - η_meta ∇_{Φ_t} L_meta(Φ_t)

4. Empirical Results and Benchmark Performance

CLAW approaches have been evaluated on a host of task-incremental and class-incremental benchmarks, consistently demonstrating superior accuracy and reduced forgetting:

Method	Testbed	ACC (↑)	BWT/Forgetting (↓)	Notes
InfluenceCL	CIFAR-100 Split	21.15%	–73.24%	(Mao et al., 24 Sep 2025) baseline
InfluenceCL + CLAW	CIFAR-100 Split	27.50%	–56.27%	CLAW plug-in
BFP	CIFAR-100 Split	47.45%	–29.85%
BFP + CLAW	CIFAR-100 Split	61.19%	–26.91%
MEAT	CIFAR-100 Split	9.68%	–95.69%
MEAT + CLAW	CIFAR-100 Split	21.97%	–44.44%
ER (Replay only)	CIFAR-100 Split	22.5%	65.6%	(Kozal et al., 5 Apr 2024)
CLAW+ER	CIFAR-100 Split	40.3%	12.8%
CoMA	CIFAR-100, ViT	92.00%	--	(Marouf et al., 2023)
CoFiMA	CIFAR-100, ViT	92.77%	--	Fisher-weighted
CLAW (Bayesian)	PermutedMNIST	99.2±0.2%	--	(Adel et al., 2019)
CLAW (Bayesian)	CIFAR-100 Split	95.6±0.3%	--

Across all settings, CLAW-based methods consistently improve test accuracy (by 5–20 percentage points) and reduce forgetting (measured by backward transfer or average drop, by 10–70 points) when used as a plug-in to state-of-the-art continual learning architectures (Mao et al., 24 Sep 2025, Kozal et al., 5 Apr 2024, Marouf et al., 2023, Adel et al., 2019).

5. Theoretical Analysis and Interpretability

Theoretical perspectives offered by CLAW variants include:

Stability–Plasticity Trade-off: The interpolation/mixing coefficient ( $\alpha$ or per-layer $\alpha^\ell$ ) embodies a direct, tunable control over how much stability (retaining old weights) vs. plasticity (adapting new weights) is enforced (Mao et al., 24 Sep 2025, Marouf et al., 2023, Kozal et al., 5 Apr 2024).
Meta-Regularization: Layerwise or neuronwise mixing can be cast as applying a learned regularization, enforcing minima in the parameter space that interpolate optimally between prior and current tasks in a non-uniform, data-driven manner (Mao et al., 24 Sep 2025, Adel et al., 2019).
Gradient-based Layerwise Relevance: By leveraging gradient statistics, the system infers which layers are critical for transfer or for task-specific adaptation, implementing per-layer specialization without rigid architectural constraints (Mao et al., 24 Sep 2025).
Variational Bayesian Gating: CLAW’s Bayesian instantiation implements soft (probabilistic) weight-sharing, meta-learned across tasks, yielding both interpretability (about which parts of the network are task-shared) and strong generalization (Adel et al., 2019).

CLAW mechanisms intersect with and extend several architectural and optimization strategies:

Soft Subnetworks and Binary Masking: Approaches like Soft-Winning SubNetworks and SoftNet leverage adaptive binary and soft masks to realize per-task subnetworks that reuse network weights efficiently, regularized by the Regularized Lottery Ticket Hypothesis (Kang et al., 2023).
Slot-based and Functional Regularization Methods: Bayesian CLAW generalizes fixed-architecture methods by allowing adaptive task-specific reweighting at the unit level (Adel et al., 2019).
Replay and Experience Rehearsal Methods: CLAW’s weight interpolation complements buffer-based replay, achieving dramatic gains in knowledge retention yet requiring only the addition of an extra model copy and minimal post-task computation (Kozal et al., 5 Apr 2024).
Adaptive SGD-style Methods: NCCL formulates an adaptive step size strategy for nonconvex continual learning, dynamically scaling updates to suppress catastrophic forgetting via closed-form control of cross-gradient interactions (Han et al., 8 Apr 2024).

7. Implementation, Memory Cost, and Hyperparameterization

Overhead: Most CLAW instantiations require either a copy of model parameters (for blending/fusion) or additional buffer for gradient summaries and mixing networks.
Computation: Permutation alignment (when needed) and batch-norm statistics update add mild cost compared to full retraining (Kozal et al., 5 Apr 2024).
Hyperparameters: The main tuning parameter is the mixing (or interpolation) coefficient, which directly controls the stability-plasticity compromise. Meta-learning variants add learning rates for generator optimization, but are robust across a broad range of values (Mao et al., 24 Sep 2025, Marouf et al., 2023, Adel et al., 2019).

8. Limitations and Future Directions

Scalability: Algorithms requiring instance-wise variational updates or per-layer fusion may incur higher computational cost compared to uniform averaging (Adel et al., 2019).
Task Granularity: Layer-level or neuron-level coefficients may be needed to resolve high-level features requiring strong sharing from localized task-specific adaptations (Mao et al., 24 Sep 2025).
Integration: Incorporating CLAW with functional, replay, and architectural expansion approaches is an ongoing research direction, as is hyperparameter automation (e.g., via hierarchical priors in Bayesian settings) (Adel et al., 2019).

In summary, CLAW (Continual Learning with Adaptive Weights) encompasses a principled set of solutions for continual learning that realize data-driven parameter blending or gating at task, layer, or neuron granularity. These methods maintain robust knowledge retention and transfer across a sequence of tasks by learning optimal fusion coefficients, with theoretical justification and empirical validation across diverse benchmarks (Mao et al., 24 Sep 2025, Marouf et al., 2023, Kozal et al., 5 Apr 2024, Adel et al., 2019).

PDF Markdown Chat (Pro)

References (6)

Adaptive Model Ensemble for Continual Learning (2025)

Weighted Ensemble Models Are Strong Continual Learners (2023)

Continual Learning with Weight Interpolation (2024)

Continual Learning with Adaptive Weights (CLAW) (2019)

Forget-free Continual Learning with Soft-Winning SubNetworks (2023)

On the Convergence of Continual Learning with Adaptive Methods (2024)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Continual Learning with Adaptive Weights (CLAW).

CLAW: Adaptive Weights for Continual Learning

1. Formal Problem Statement and Motivation

2. Adaptive Weight Fusion Strategies

Meta-Weight-Ensembler (Layer-wise meta-learned mixing) (Mao et al., 24 Sep 2025)

Model Averaging and Fisher-weighted Averaging (Marouf et al., 2023)

Weight Interpolation with Permutation Alignment (Kozal et al., 5 Apr 2024)

Bayesian Adaptive Weight Gating (Adel et al., 2019)

3. Meta-Learning, Optimization, and Algorithmic Structure

4. Empirical Results and Benchmark Performance

5. Theoretical Analysis and Interpretability

7. Implementation, Memory Cost, and Hyperparameterization

8. Limitations and Future Directions

Whiteboard

Follow Topic

Continue Learning

CLAW: Adaptive Weights for Continual Learning

1. Formal Problem Statement and Motivation

2. Adaptive Weight Fusion Strategies

Meta-Weight-Ensembler (Layer-wise meta-learned mixing) (Mao et al., 24 Sep 2025)

Model Averaging and Fisher-weighted Averaging (Marouf et al., 2023)

Weight Interpolation with Permutation Alignment (Kozal et al., 5 Apr 2024)

Bayesian Adaptive Weight Gating (Adel et al., 2019)

3. Meta-Learning, Optimization, and Algorithmic Structure

4. Empirical Results and Benchmark Performance

5. Theoretical Analysis and Interpretability

6. Connections to Related Architectures

7. Implementation, Memory Cost, and Hyperparameterization

8. Limitations and Future Directions

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics