Continual Learning with Experience and Replay
- The paper introduces CLEAR, a framework that leverages experience replay to counteract catastrophic forgetting during sequential learning.
- It combines on- and off-policy updates with behavioral cloning to balance fast adaptation and long-term stability.
- Empirical results on DeepMind Lab and Atari illustrate CLEAR's robust retention of past skills even under strict memory constraints.
Continual Learning with Experience And Replay (CLEAR) is a foundational approach devised to address the problem of catastrophic forgetting in neural networks during sequential task learning, especially within reinforcement learning (RL) frameworks. Catastrophic forgetting denotes the tendency of models, particularly those trained via stochastic gradient descent, to lose proficiency on previous tasks when adapted to new, possibly dissimilar tasks. CLEAR's central contribution lies in leveraging an experience replay buffer—traditionally used for sample efficiency in RL—to harmonize the competing needs of stability (retaining old knowledge) and plasticity (quick adaptation to new experiences), under both unconstrained and memory-limited regimes.
1. Experience Replay and Architectural Foundations
The core mechanism in CLEAR is an experience replay buffer that accumulates trajectories or environmental interactions—comprising states, actions, rewards, and policy outputs—over the agent's entire learning history. During every step of training, the agent does not train solely on its most recent trajectories, but rather samples both new and stored data, enabling continued rehearsal of prior experiences.
CLEAR is architected atop the V-Trace off-policy actor-critic algorithm (see Espeholt et al., 2018), which enables effective off-policy learning. The value function target for state is: where temporal-difference errors are importance-weighted by and ; these weights correct for the discrepancy between the policy currently being optimized () and the policy under which the experience in the buffer () was generated:
On-policy and off-policy learning are mixed by sampling a ratio of recent experiences (on-policy) and stored buffer experiences (off-policy).
2. On- and Off-Policy Integration for Stability and Plasticity
CLEAR achieves a principled trade-off between the preservation of old skills and the adoption of new behaviors by blending on-policy and off-policy updates. A typical configuration is a 50:50 ratio of new to replayed samples in each minibatch, though alternative splits (e.g., 75:25) are also effective. On-policy samples ensure continued plasticity—rapid learning adapted to the prevailing environment—while off-policy replay stabilizes earlier-acquired knowledge by preventing drift as new data is absorbed.
Empirical studies in both DeepMind Lab and Atari domains confirm that this mixture is crucial. Without replay (i.e., pure on-policy learning), catastrophic forgetting is pronounced; as replay increases, forgetting diminishes and stability across sequential tasks improves, even in scenarios lacking explicit task switch signals.
3. Behavioral Cloning as an Auxiliary Constraint
In addition to value-based replay, CLEAR employs behavioral cloning regularization for replayed data using two distinct losses:
- Policy cloning (KL divergence):
This term encourages the current agent to mimic the behavioral policy active during the original experience.
- Value function cloning (L2 loss):
This penalty compels the agent to preserve the same value function predictions on replay experiences as were made when those experiences were first collected.
The addition of these cloning losses was shown to further mitigate forgetting and stabilize policy outputs, yielding performance comparable to multitask learners that require explicit task labels.
4. Handling Memory Constraints
Experience replay in CLEAR is robust to buffer size limitations. When it is infeasible to store all historic experiences, reservoir sampling is used: this algorithm randomly overwrites buffer entries such that every experience has an equal likelihood of retention. Results indicate that even a buffer holding as little as of observed data can deliver strong resistance to forgetting, with significant performance only deteriorating for exceedingly small buffers.
5. Generalization and Task-Agnostic Operation
CLEAR is uniquely designed to function without knowledge of task boundaries or identities. Unlike regularization-based approaches that rely on side information about task structure, CLEAR simply treats all data as a non-stationary stream, allowing the network to naturally revisit and maintain utility on prior tasks through stochastic replay. Old tasks reappear as a natural consequence of sampling from a chronologically diverse buffer.
Notably, probe task evaluations demonstrate that continually training on old and new samples can foster shared representations that not only protect prior knowledge but also accelerate adaptation to related new tasks, further evidencing the generality and efficiency of this method.
6. Empirical Performance and Comparative Analysis
CLEAR was benchmarked on both DeepMind Lab and Atari, matching or exceeding the performance of leading methods such as Progress & Compress (P&C) and Elastic Weight Consolidation (EWC)—especially the latter—while not requiring explicit task boundaries or identity indicators.
For example:
- DeepMind Lab: CLEAR nearly eliminates forgetting even when tasks are switched cyclically.
- Atari: CLEAR equals or outperforms baselines, particularly when compared to regularization approaches.
This compatibility with task-agnostic settings and strong empirical surface places the algorithm as an effective and scalable method for real-world continual RL.
Component | Mechanism / Loss | Contribution |
---|---|---|
Experience Replay | Buffer of past experiences | Memory of old tasks |
On-policy Learning | Current interactions | Plasticity |
Off-policy Learning | V-Trace-corrected replay | Stability |
Behavioral Cloning | KL & L2 on replay | Higher task retention |
Buffer Constraints | Reservoir sampling | Uniform history sampling |
Task-Agnostic Handling | No task indicators | Generalization |
7. Applicability, Limitations, and Deployment
CLEAR is readily deployable in RL settings where task structure is implicit or absent, as in many real-world environments. Its reliance on replay buffers and behavioral cloning introduces modest computational overhead compared to standard RL architectures, but empirical evidence suggests favorable scaling, including under strict storage constraints.
A potential limitation is the risk of overfitting to the small replay buffer under severe memory constraints, highlighting the importance of diversity in buffer content. Behavioral cloning mitigates some of this effect by regularizing policy and value outputs, but buffer sampling strategies and overrepresentation of older data remain practical considerations.
Conclusion
CLEAR constitutes a practical, generalist continual reinforcement learning method that offers robust catastrophic forgetting resistance through experience replay, with or without explicit task structure. It combines on-policy and off-policy policy/value updates with behavioral cloning to maintain both plasticity and stability, and through use of reservoir sampling, remains effective even in constrained-memory settings. As such, CLEAR represents a foundational baseline and a flexible deployment candidate for continual RL in dynamic, real-world scenarios.