CLAW: Adaptive Weights for Continual Learning
- CLAW is a framework that adaptively blends neural network weights to enable continual learning and reduce catastrophic forgetting.
- It utilizes layer- and neuron-specific, data-driven weighting mechanisms to smartly balance stability with task-specific plasticity.
- Empirical studies show that CLAW methods enhance accuracy by 5–20% and reduce forgetting by up to 70 percentage points across benchmarks.
Continual Learning with Adaptive Weights (CLAW) comprises a set of algorithmic frameworks that enable neural networks to sequentially acquire new tasks by adaptively interpolating or modifying their parameters, thereby balancing knowledge retention (stability) and task adaptation (plasticity). These methods operate across various architectures, learning scenarios (task-incremental, class-incremental, few-shot), and optimization paradigms (model averaging, meta-learning, Bayesian gating), but are unified in their use of data-driven, instance- or layer-specific weighting for parameter update or sharing. CLAW approaches significantly mitigate catastrophic forgetting and realize state-of-the-art continual learning performance across multiple benchmarks (Mao et al., 24 Sep 2025, Marouf et al., 2023, Kozal et al., 5 Apr 2024, Adel et al., 2019).
1. Formal Problem Statement and Motivation
Continual learning targets the learning of a sequence of tasks in an online regime, where only data from the current task is available for training and previous-task data is not revisited (or only a small buffer is stored). The main obstacles are:
- Catastrophic Forgetting: After updating on task , performance on tasks can degrade substantially.
- Stability-Plasticity Dilemma: Maintaining performance on old (stability) and acquiring new knowledge (plasticity) are often at odds.
CLAW algorithms address this by adaptively determining where and how much to share, interpolate, or adapt model parameters across tasks and layers. Some employ learnable gates (binary/continuous) per neuron or per layer (Adel et al., 2019); others interpolate weights globally, per-layer, or per-parameter using learned or data-driven coefficients (Mao et al., 24 Sep 2025, Marouf et al., 2023, Kozal et al., 5 Apr 2024).
2. Adaptive Weight Fusion Strategies
CLAW methods instantiate adaptive weight updates under diverse mechanisms:
Meta-Weight-Ensembler (Layer-wise meta-learned mixing) (Mao et al., 24 Sep 2025)
- For task , computes two sets of parameters: old () and new (), after training on .
- Employs a small multilayer perceptron (MLP) “mixing coefficient generator” that maps layerwise gradient summaries to mixing coefficients .
- Fused weights per layer :
This enables task- and layer-level trade-offs—layers with minimal change retain old knowledge, task-critical layers adapt (Mao et al., 24 Sep 2025).
Model Averaging and Fisher-weighted Averaging (Marouf et al., 2023)
- CoMA: Linear interpolation at whole-network scale:
- CoFiMA: Per-parameter Fisher-weighted blending:
Here, is the task-likelihood Fisher information (importance) for parameter .
Weight Interpolation with Permutation Alignment (Kozal et al., 5 Apr 2024)
- After task is learned, finds a permutation alignment between the new and previous weights, then interpolates:
- Interpolation coefficient controls stability-plasticity; typically tuned per method (Kozal et al., 5 Apr 2024).
Bayesian Adaptive Weight Gating (Adel et al., 2019)
- Each neuron is equipped with a binary adaptation gate (share/adapt) and a continuous adaptation strength .
- Effective weights for task :
- Variational inference infers the gate and strength distributions, jointly maximizing per-task data likelihood and minimizing changes in the parameter distribution (via KL regularization) (Adel et al., 2019).
3. Meta-Learning, Optimization, and Algorithmic Structure
CLAW frameworks span a spectrum of meta-learning and adaptation protocols:
- Bilevel Optimization: In Meta-Weight-Ensembler, MLP generator parameters are meta-learned via feedback on a buffer of prior-task data. Base-learner adapts to the new task by SGD; meta-learner adjusts so that fused models perform well on previous tasks (Mao et al., 24 Sep 2025).
- Simple Model Averaging: CoMA and CoFiMA require only taskwise weight updates plus (optionally) a Fisher estimation step. No meta-learning loop is required (Marouf et al., 2023).
- Batch-norm Re-estimation: CLAW with weight interpolation requires a batch-norm statistics update post-interpolation to stabilize activations (Kozal et al., 5 Apr 2024).
- Variational Bayesian Updates: CLAW with neuron-wise gates optimizes an evidence lower bound (ELBO) using gradient-based variational updates for both global and per-task adaptation parameters (Adel et al., 2019).
Pseudocode for Meta-Weight-Ensembler:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
for t = 1 to T: # Train new model on D_t init \hatΘ_t ← Θ_{t-1} for s = 1…S: \hatΘ_t ← \hatΘ_t - η_task ∇_{\hatΘ_t} L_task(\hatΘ_t; D_t) # Update buffer with samples from D_t # Meta-update mixing generator for m = 1…M: G_t = ∇_{\hatΘ_t} L_task(\hatΘ_t; D_t) α_t = g_{Φ_t}(G_t) Θ_t = α_t ⊙ \hatΘ_t + (1-α_t) ⊙ Θ_{t-1} L_meta(Φ_t) = L_task(Θ_t; ˆD) Φ_t ← Φ_t - η_meta ∇_{Φ_t} L_meta(Φ_t) |
4. Empirical Results and Benchmark Performance
CLAW approaches have been evaluated on a host of task-incremental and class-incremental benchmarks, consistently demonstrating superior accuracy and reduced forgetting:
| Method | Testbed | ACC (↑) | BWT/Forgetting (↓) | Notes |
|---|---|---|---|---|
| InfluenceCL | CIFAR-100 Split | 21.15% | –73.24% | (Mao et al., 24 Sep 2025) baseline |
| InfluenceCL + CLAW | CIFAR-100 Split | 27.50% | –56.27% | CLAW plug-in |
| BFP | CIFAR-100 Split | 47.45% | –29.85% | |
| BFP + CLAW | CIFAR-100 Split | 61.19% | –26.91% | |
| MEAT | CIFAR-100 Split | 9.68% | –95.69% | |
| MEAT + CLAW | CIFAR-100 Split | 21.97% | –44.44% | |
| ER (Replay only) | CIFAR-100 Split | 22.5% | 65.6% | (Kozal et al., 5 Apr 2024) |
| CLAW+ER | CIFAR-100 Split | 40.3% | 12.8% | |
| CoMA | CIFAR-100, ViT | 92.00% | -- | (Marouf et al., 2023) |
| CoFiMA | CIFAR-100, ViT | 92.77% | -- | Fisher-weighted |
| CLAW (Bayesian) | PermutedMNIST | 99.2±0.2% | -- | (Adel et al., 2019) |
| CLAW (Bayesian) | CIFAR-100 Split | 95.6±0.3% | -- |
Across all settings, CLAW-based methods consistently improve test accuracy (by 5–20 percentage points) and reduce forgetting (measured by backward transfer or average drop, by 10–70 points) when used as a plug-in to state-of-the-art continual learning architectures (Mao et al., 24 Sep 2025, Kozal et al., 5 Apr 2024, Marouf et al., 2023, Adel et al., 2019).
5. Theoretical Analysis and Interpretability
Theoretical perspectives offered by CLAW variants include:
- Stability–Plasticity Trade-off: The interpolation/mixing coefficient ( or per-layer ) embodies a direct, tunable control over how much stability (retaining old weights) vs. plasticity (adapting new weights) is enforced (Mao et al., 24 Sep 2025, Marouf et al., 2023, Kozal et al., 5 Apr 2024).
- Meta-Regularization: Layerwise or neuronwise mixing can be cast as applying a learned regularization, enforcing minima in the parameter space that interpolate optimally between prior and current tasks in a non-uniform, data-driven manner (Mao et al., 24 Sep 2025, Adel et al., 2019).
- Gradient-based Layerwise Relevance: By leveraging gradient statistics, the system infers which layers are critical for transfer or for task-specific adaptation, implementing per-layer specialization without rigid architectural constraints (Mao et al., 24 Sep 2025).
- Variational Bayesian Gating: CLAW’s Bayesian instantiation implements soft (probabilistic) weight-sharing, meta-learned across tasks, yielding both interpretability (about which parts of the network are task-shared) and strong generalization (Adel et al., 2019).
6. Connections to Related Architectures
CLAW mechanisms intersect with and extend several architectural and optimization strategies:
- Soft Subnetworks and Binary Masking: Approaches like Soft-Winning SubNetworks and SoftNet leverage adaptive binary and soft masks to realize per-task subnetworks that reuse network weights efficiently, regularized by the Regularized Lottery Ticket Hypothesis (Kang et al., 2023).
- Slot-based and Functional Regularization Methods: Bayesian CLAW generalizes fixed-architecture methods by allowing adaptive task-specific reweighting at the unit level (Adel et al., 2019).
- Replay and Experience Rehearsal Methods: CLAW’s weight interpolation complements buffer-based replay, achieving dramatic gains in knowledge retention yet requiring only the addition of an extra model copy and minimal post-task computation (Kozal et al., 5 Apr 2024).
- Adaptive SGD-style Methods: NCCL formulates an adaptive step size strategy for nonconvex continual learning, dynamically scaling updates to suppress catastrophic forgetting via closed-form control of cross-gradient interactions (Han et al., 8 Apr 2024).
7. Implementation, Memory Cost, and Hyperparameterization
- Overhead: Most CLAW instantiations require either a copy of model parameters (for blending/fusion) or additional buffer for gradient summaries and mixing networks.
- Computation: Permutation alignment (when needed) and batch-norm statistics update add mild cost compared to full retraining (Kozal et al., 5 Apr 2024).
- Hyperparameters: The main tuning parameter is the mixing (or interpolation) coefficient, which directly controls the stability-plasticity compromise. Meta-learning variants add learning rates for generator optimization, but are robust across a broad range of values (Mao et al., 24 Sep 2025, Marouf et al., 2023, Adel et al., 2019).
8. Limitations and Future Directions
- Scalability: Algorithms requiring instance-wise variational updates or per-layer fusion may incur higher computational cost compared to uniform averaging (Adel et al., 2019).
- Task Granularity: Layer-level or neuron-level coefficients may be needed to resolve high-level features requiring strong sharing from localized task-specific adaptations (Mao et al., 24 Sep 2025).
- Integration: Incorporating CLAW with functional, replay, and architectural expansion approaches is an ongoing research direction, as is hyperparameter automation (e.g., via hierarchical priors in Bayesian settings) (Adel et al., 2019).
In summary, CLAW (Continual Learning with Adaptive Weights) encompasses a principled set of solutions for continual learning that realize data-driven parameter blending or gating at task, layer, or neuron granularity. These methods maintain robust knowledge retention and transfer across a sequence of tasks by learning optimal fusion coefficients, with theoretical justification and empirical validation across diverse benchmarks (Mao et al., 24 Sep 2025, Marouf et al., 2023, Kozal et al., 5 Apr 2024, Adel et al., 2019).