Scaling Laws for Forgetting

Updated 25 October 2025

Scaling laws for forgetting are mathematical and empirical laws that quantify memory decay via power-law and exponential dynamics across biological, cognitive, and computational models.
They describe universal tradeoffs between memory retention and controlled forgetting, illustrated through models ranging from synaptic metaplasticity to LSTM and Transformer architectures.
The insights enable optimization of fine-tuning regimes, continual learning strategies, and unlearning processes to balance stability, plasticity, and computational resources.

Scaling laws for forgetting refer to mathematical, empirical, and algorithmic regularities that characterize how memory traces, knowledge, or model performance deteriorate as a function of model architecture, parameterization, training regimes, and intervention methods. This topic encompasses biological synapses, cognitive models, statistical theories, logical formulas, as well as neural and LLMs. Scaling laws govern both the efficiency and limitations of memory retention, describe power-law and exponential decay regimes, and establish fundamental tradeoffs between plasticity, stability, computational resources, and the capacity for adaptive or controlled forgetting.

1. Power-Law and Exponential Forgetting in Biological and Cognitive Systems

In metaplastic synaptic models, the cascade architecture posits a series of hidden states (levels) over which memories are stored (Mehta et al., 2011). Transition probabilities between states decrease exponentially with the depth, governed by a dynamical length ξₙ. The default occupancy profile is set by a static length ξ_s. The universal scaling law for long-term memory decay, derived for both local and non-local resetting architectures, is:

$D(t) \sim t^{–\theta}$ , $\theta = 1 + \xi_s/\xi_d$

The universality of this exponent implies that long-term forgetting is robust to architectural details, while transient dynamics differ according to specific synaptic mechanisms.

Retroactive interference theory extends these laws to cognitive phenomena, modeling memory as a competitive hierarchy where each new acquisition retroactively erases weaker memories (Georgiou et al., 2019). Analytical results yield retention curves:

For n-dimensional “valence” (complexity): $R_n(t) \sim \frac{(\log(t+1))^{n-1}}{(t+1)}$

Empirical recognition experiments with n = 5 closely match this scaling, indicating that higher memory complexity slows forgetting due to multidimensional selective competition.

Alternative cognitive models use interference and search cost principles, with retention probability expressed as a regularized incomplete Gamma function (Yu et al., 2018):

$P_m(n_0, kt) = \Gamma(n_0+1, kt)/\Gamma(n_0+1)$
Exponential decay for single events ( $n_0 = 0$ ), slower decay for repeated exposures ( $n_0 > 0$ ), reproducing the Ebbinghaus forgetting curve.

These theoretical approaches all link the scaling exponent of forgetting either to structural complexity, statistical interference, or the arrangement of underlying states.

2. Scaling Laws in Logical Formulas and Knowledge Representation

Forgetting in symbolic domains (e.g., propositional logic) is defined as variable elimination while preserving consequences over a visible alphabet. Paradoxically, forgetting may increase the size of the formula, producing nontrivial scaling laws (Liberatore, 2020):

For Horn formulas, deciding if forgetting can be represented within size k is $D^P$ -hard and in $\Sigma^P_2$ .
For unrestricted CNF formulas, the problem is $D^P_2$ -hard in $\Sigma^P_3$ .

A complementary perspective uses common equivalence, where the scaling law follows:

Worst-case size of a forgotten formula grows exponentially with number of forgotten variables, unless auxiliary variables are introduced (Liberatore, 2020).

Computational procedures (body_replace, head_implicates) allow exponentially large outputs to be computed in polynomial space, while minimizing formula size under common equivalence is NP-hard if variable introduction is permitted.

Loss functions for inferential strength augment this with quantitative scaling (Doherty et al., 3 Apr 2024):

For strong (existential) forgetting: $loss^{NC}_m = P(Th) - P(F^{NC}(Th;\vec{p}))$
For weak (universal) forgetting: $loss^{SC}_m = P(F^{SC}(Th;\vec{p})) - P(Th)$

Both monotonicity and additivity properties hold: forgetting more symbols always increases the loss, which quantifies the gap in inferential power as a function of forgotten content.

3. Scaling Laws in Neural Network Architectures and Recurrent Models

Standard LSTMs employ exponential decay in their forget gates, resulting in rapid information loss. The power-law forget gate replaces this with:

$c_t = c_0 \cdot (t-t_0+1)^{-p}$

Here, $p$ is a learnable decay factor, enabling slower decay that can be tuned per unit, task, and timescale (Chien et al., 2021). Experimentally, power-law forgetting in LSTMs (pLSTM) allows retention of memory traces over hundreds to thousands of steps, outperforming vanilla models on long-range tasks. The parameter $p$ directly scales the “memory length,” providing a controllable retention-fading spectrum.

Selective memory expiration in Transformers (Expire-Span) generalizes this principle to attention-based models:

Each token gets a learnable expiration $e_i = L \cdot \sigma(w^T h_i + b)$
Memory is masked as $m_{t,i} = \max(0, \min(1, 1 + r_{t,i}/R))$ with $r_{t,i} = e_i - (t-i)$ (Sukhbaatar et al., 2021)

This creates dynamic, context-dependent retention windows that scale to tens of thousands of timesteps, enhancing both efficiency and memory management.

4. Forgetting during Fine-Tuning of LLMs

Forgetting in LLMs manifests as the loss of pretraining knowledge and safety guardrails during fine-tuning. Two recent studies establish quantitative scaling laws:

There is an inverse linear relationship between fine-tuning loss and forgetting: $\mathcal{L}_f(\mathcal{L}_{ft}) = -c_{f,ft} \cdot \mathcal{L}_{ft} + s_{f,ft}$ (Kalajdzievski, 11 Jan 2024)
Both forgetting and fine-tuning loss scale as shifted power laws in the number of parameters fine-tuned (P) and number of update steps (N): $\mathcal{L}_{ft}(P,N) = c_{ft} \cdot [(a_{ft}P)^{\alpha_{ft}} + (b_{ft}N)^{\beta_{ft}}]^{\rho} + s_{ft}$

These laws predict that more aggressive fine-tuning (larger P, more steps, lower fine-tuning loss) unavoidably increases forgetting. Parameter-efficient fine-tuning methods (LoRA) mitigate computational cost but do not eliminate this effect.

Mitigation through pretraining data injection during fine-tuning (mixing as little as 1% pretraining data) arrests forgetting and anchors retention of generic knowledge (Bethune et al., 9 Feb 2025):

Pretraining loss after fine-tuning scales as $\mathcal{L}_{pt} = \mathcal{L}_{pt}^0 + A \cdot D_{ft}^{\beta}/((1+B p) N^{\alpha}) + E$

These results highlight the universality and practical relevance of scaling laws, enabling extrapolation and optimization of fine-tuning regimes to minimize undesired memory drift.

5. Scaling Limits and Tradeoffs in Continual and Reinforcement Learning

Continual learning exposes networks to sequences of non-stationary tasks, leading to catastrophic forgetting. The impact of scaling is nuanced:

In “lazy” regimes (minimal feature learning, $γ_0 \to 0$ ), increasing width reduces forgetting (Graldi et al., 20 Jun 2025).
In “rich” regimes (high feature learning, $γ_0 \to 1$ ), overparameterization does not alleviate forgetting; the optimal stability-plasticity tradeoff is found at an intermediate $γ_0^\star$ (often $\approx 0.1$ ).

Dynamical mean field theory provides self-consistent equations linking model scale, feature evolution, and task similarity to retention performance.

In reinforcement learning, primacy bias (overfitting early experiences) can be addressed through controlled forgetting (Experience Replay Decay) and dynamic network expansion (Kang et al., 3 Jul 2025):

ER Decay applies an exponential decay to sampling weights: $W_{t,i} = \max(T, (1-\epsilon)^{t-2})$
Network Expansion adds blocks to neural architectures during training, akin to “infantile amnesia,” where new neurons promote forgetting of outdated traces.

This dual “forget-and-grow” mechanism unlocks scalability and generalization, empirically bounding the sample count for any transition and facilitating robust scaling to large networks and diverse environments.

6. Unlearning and Stability in LLMs

Machine unlearning requires selectively removing the influence of undesirable data without destabilizing the model. The gradient difference method, which applies descent on retained data and ascent on forget data, is unstable under cross-entropy loss—theoretically shown to cause unbounded growth of weights and gradients (Garg et al., 29 Sep 2025):

If $||a_{L-1}(t)|| \leq C_1$ , then $||z(t)|| \leq ||W_L(t)|| \cdot C_1$ ; as $\mathcal{L} \to \infty$ , $||W_L(t)||$ must diverge.

Bounding the LoRA adapter update with a function $\phi$ (e.g., tanh or $\sin(\omega \cdot)$ ) stabilizes the forgetting operation:

Modified adapter: $h = W_0 x + \phi(AB^T)x$ , with $\phi$ bounded.

Empirically, this enables stable, scalable parameter-efficient unlearning across architectures (GPT-Neo, Phi, LLaMA families) and sizes (125M–8B), with improved forget quality, privacy, and retention. The rank-agnostic nature further establishes a scaling law: efficiency of forgetting remains robust as model size and adapter parameters increase, provided the bounded update stabilizes optimization.

7. Conclusion

Scaling laws for forgetting permeate biological, cognitive, logical, and machine learning systems. Universal exponents, power-law relationships, and exponential size blow-ups define the limits and capabilities of memory retention and controlled forgetting. Recent results rigorously quantify these laws in large-scale neural and symbolic models, revealing both practical constraints (memory, computation, tradeoffs with retention) and design principles (anchoring mechanisms, dynamic architectures, bounded updates). These findings collectively dictate how future systems can be tuned or adapted for optimal balance between learning efficiency and memory stability across domains.