Primacy Bias in Deep RL

Updated 24 April 2026

The paper identifies primacy bias as the over-representation of early experiences in replay buffers, causing neural plasticity loss and degraded task adaptability.
It explains the underlying mechanisms such as gradient collapse, activation sparsity, and high-curvature regions in parameter space quantified by the Fisher Information Matrix.
The paper proposes remedies like periodic resets, activation function replacements, and experience replay decay to mitigate bias and enhance continual learning.

Primacy bias in deep reinforcement learning (RL) is the empirically grounded tendency for neural network–based agents, when trained continually or with high replay ratios, to overfit their earliest experiences and consequently lose the ability to adapt to new data or tasks. This effect manifests across value-based and actor–critic methods, on both stationary and non-stationary task streams, and is associated with a loss of plasticity in the neural function approximator. Modern research describes and quantifies primacy bias, analyzes network-level mechanisms, and proposes algorithmic modifications to prevent plasticity collapse and restore continual learning capacity (Nikishin et al., 2022, Abbas et al., 2023, Chung et al., 2024, Kang et al., 3 Jul 2025, Falzari et al., 2 Feb 2025).

1. Formal Definitions and Theoretical Foundations

In deep RL with experience replay, the primacy bias is the property that early transitions in the replay buffer receive a disproportionate number of gradient updates, especially as the buffer grows. For a buffer of $N$ samples and a replay ratio $R$ (number of SGD updates per environment step), the expected number of updates involving the $i$ th sample (added at time $i$ ) is approximately $c_i(T)=R\,[\ln T-\ln i]$ up to time $T$ , so the earliest experiences dominate the optimization trajectory (Nikishin et al., 2022, Kang et al., 3 Jul 2025).

In continual RL, where an agent cycles through tasks or environments without resets, primacy bias arises as "experiences seen early in training exert an outsized and enduring influence on the network parameters, to the point that the agent stops adapting to new tasks or environment variations” (Abbas et al., 2023). The parameterization of the Q-function or policy becomes inert in regions required for later tasks; gradients with respect to new data vanish due to dead activations or ill-conditioned weights (Abbas et al., 2023, Chung et al., 2024).

The phenomenon can also be formalized via the Fisher Information Matrix (FIM): early experiences induce high-curvature regions in parameter space, locking network weights so that subsequent updates are less effective (“memorization phase” followed by a sharp drop and plateau in the trace of the FIM, signaling reduced plasticity) (Falzari et al., 2 Feb 2025).

2. Empirical Manifestations and Diagnostics

Primacy bias is consistently observed in standard off-policy deep RL with high replay ratios, in continual learning scenarios, and with non-stationary reward/task streams. The following metrics and indicators are reported:

Performance Degradation: Episodic returns on revisited tasks in continual RL settings decline over successive visits, relative to a reset baseline (Abbas et al., 2023).
Gradient Collapse: The $\ell_p$ -norms of gradients, especially in critical layers, fall by over an order of magnitude as the agent cycles through tasks; $\|\nabla_\theta L\|_p\rightarrow 0$ indicates the network becomes untrainable (Abbas et al., 2023).
Activation Sparsity: The fraction of active post-activation units plummets in downstream MLP layers (to $<1\%$ in some cases), corresponding to “activation collapse” and cessation of learning (Abbas et al., 2023).
Plasticity Loss: In sequential-task regimes, success rate on new tasks typically drops to near zero and recovers poorly (Chung et al., 2024).
Replay Bias Quantification: The primacy ratio (sampling weight of the earliest $p$ fraction of data over the latest) sharply exceeds one in unmitigated agents, reflecting increased sampling of initial experiences (Nikishin et al., 2022, Kang et al., 3 Jul 2025).

Table: Typical Primacy Bias Signatures Across Scenarios

Manifestation	Metric or Signal	Reference
Early experience overfit	$R$ 0	(Nikishin et al., 2022)
Gradient collapse	Diminishing $R$ 1	(Abbas et al., 2023)
"Dead" activations	Activation sparsity $R$ 2 in value/adv layers	(Abbas et al., 2023)
Loss of task plasticity	Declining per-task return, poor recovery	(Chung et al., 2024)

3. Mechanistic Analyses: Network Dynamics and Replay

The primacy bias is driven by a combination of replay buffer effects, optimizer dynamics, and neural function approximation properties:

Replay Buffer Chronology: In uniform replay with fixed-rate updates, early samples persist longer and thus dominate learning. The expected “usage count” of the oldest transition is $R$ 3 per environment step, versus $R$ 4 for recent samples (Kang et al., 3 Jul 2025, Nikishin et al., 2022).
Optimizer Effects: Under high update-to-data ratios (UTD), Adam’s second-moment estimates amplify spurious large gradients on early or out-of-distribution actions, driving Q-values to diverge and locking weights to early modes (“value divergence”) (Hussing et al., 2024).
Activation Collapse: ReLU-based networks in continual settings develop extreme sparsity, with most downstream units clamped to zero; no gradient is passed, so representations needed for new tasks never emerge (Abbas et al., 2023).
Weight Matrix Conditioning: During sequential task learning, weight-matrix rank and stable rank decrease, neuron-weight correlations rise, and input–output Jacobian variance increases, resulting in a network less able to update along novel task directions (Chung et al., 2024).
Parameter Geometry: Early data create high-curvature regions (FIM eigenvalues increase), rendering weights locally “sticky” and later innovations ineffective unless plasticity is explicitly restored (Falzari et al., 2 Feb 2025).

4. Algorithmic Remedies and Regularization Approaches

Several mitigation strategies have been introduced and empirically validated:

Partial or Periodic Resets: Re-initializing parts or all of the network at fixed intervals (without resetting the buffer) “breaks” the effect of primacy bias, allowing rapid relearning of relevant representations while leveraging stored data (Nikishin et al., 2022, Kim et al., 2023). The method improves sample efficiency but can cause performance collapses immediately after reset unless ensemble averaging is employed (Kim et al., 2023).
Deep Ensemble Resetting: Resetting individual members of a policy or value ensemble sequentially and aggregating actions using the Q-value of the oldest head removes abrupt collapse after reset and provides both diversity and continual access to prior experience (Kim et al., 2023).
Activation Function Replacement: Swapping ReLU for Concatenated ReLU (CReLU), which guarantees at least half of units remain active, restores gradient flow, prevents complete activation sparsity, and maintains performance across task switches (Abbas et al., 2023).
Weight Matrix Regularization: Parseval regularization penalizes deviations from orthogonality in weight matrices, preserving isometry and thus plasticity. Networks retain high stable rank, near-zero neuron-weight correlation, and uniform Jacobian variance under sequential tasks (Chung et al., 2024).
Fisher-Guided Selective Forgetting (FGSF): Periodically injecting FIM-shaped noise into parameters selectively disrupts early “memorization modes” and reestablishes capacity to represent new data, especially effective for the critic (Falzari et al., 2 Feb 2025).
Experience Replay Decay (ER Decay): Assigns time-decaying weights to replay buffer contents so that older transitions are sampled with exponentially decreasing probability. This bounds every transition’s usage to $R$ 5 per environment step and flattens the replay-driven training skew (Kang et al., 3 Jul 2025).
Network Expansion: Dynamically enlarging the critic by adding new residual blocks rejuvenates plasticity after capacity collapse, as fresh parameters are unexposed to overfit early data (Kang et al., 3 Jul 2025).

5. Primacy Bias in Model-Based RL and Continual Learning

In model-free RL, agent parameter primacy dominates; resetting the agent network disrupts overfit to initial experience. In model-based RL, however, “primacy” most strongly affects the world model—the learned dynamics function. Overfitting to early buffer transitions in the world model impairs subsequent policy performance, even if the agent’s networks remain flexible (Qiao et al., 2023). Resetting only the world model’s last layers, on a schedule tuned to the model UTD ratio, preserves adaptation and yields significant gains in continuous- and discrete-control domains (Qiao et al., 2023).

In continual RL, primacy bias is tightly linked with loss of plasticity: agents show sensitivity to task order, declining improvement on successive tasks, and catastrophic forgetting (failure to learn or recover on new tasks). Orthogonalization of weight matrices (Chung et al., 2024) or invariant feature activations (Abbas et al., 2023) are central for retaining plasticity in continual regimes.

6. Comparative Empirical Insights and Benchmarks

A body of experiments across Atari 100k, DeepMind Control Suite, MuJoCo, MetaWorld, CARL, and HumanoidBench demonstrates that uncorrected primacy bias severely degrades final return, learning speed, and adaptability. For example, in off-policy RL, ER Decay and network expansion (“Forget+Grow”/FoG) yields state-of-the-art normalized return (0.92), surpassing strong baselines SimBa, BRO, and TD-MPC2 (0.69–0.76) over 40+ tasks (Kang et al., 3 Jul 2025). Parseval regularization delivers an absolute 20pp increase in continual RL success on MetaWorld 20–10, a 50% gain in median GridWorld success rate, and a 10–15pp gain on CARL tasks (Chung et al., 2024). FGSF leads to 25–50 % improvement in complex DMC tasks over standard and reset-based SAC (Falzari et al., 2 Feb 2025). Reset Deep Ensemble methods eliminate performance collapses and improve safety (halving constraint violations in Safe RL settings) versus vanilla resets (Kim et al., 2023).

7. Practical Recommendations and Future Directions

Best practices for mitigating primacy bias include:

Monitor replay sampling ratios and plasticity-related metrics (gradient norms, activation sparsity, weight matrix rank, FIM trace) to detect early onset of bias.
Implement periodic partial resets or ensemble-based resets, tuning interval and layers for network size and task nonstationarity (Nikishin et al., 2022, Kim et al., 2023).
Employ architectural modifications (CReLU, Parseval regularization, network expansion) to maintain continuous activation and avoid capacity collapse (Abbas et al., 2023, Chung et al., 2024, Kang et al., 3 Jul 2025).
Use replay and optimizer schemes (ER Decay, OFN, FIM-based noise, Adam parameter tuning) that decouple the domination of early data (Hussing et al., 2024, Kang et al., 3 Jul 2025, Falzari et al., 2 Feb 2025).
In model-based RL, prioritize regular re-initialization of the world model, not the agent policy, to ensure adaptation to new buffer evidence (Qiao et al., 2023).
Adjust replay ratios, buffer sizes, and task shuffling to balance stability and adaptability; avoid excessive replay of early transitions.

The collective results suggest that overcoming primacy bias is fundamental for robust, sample-efficient continual and off-policy deep RL. Ongoing research pinpoints the control of experience replay, maintenance of parameter plasticity, and architectural regularization as key design axes for future algorithm development (Nikishin et al., 2022, Abbas et al., 2023, Chung et al., 2024, Kang et al., 3 Jul 2025, Falzari et al., 2 Feb 2025).