Contrastive Experience Replay
- Contrastive Experience Replay is a technique that selects transitions based on their causal impact to improve credit assignment and sample efficiency.
- It combines key transitions and contrastive samples in RL with generative synthetic replay in continual learning to overcome memory and scalability challenges.
- Empirical evaluations show faster convergence and reduced catastrophic forgetting compared to conventional uniform or error-prioritized replay methods.
Contrastive Experience Replay (CER) denotes a family of sample selection and rehearsal mechanisms in reinforcement learning and continual learning that prioritize transitions or synthetic samples based on their contrastive or causally informative properties. CER algorithms emphasize counterfactual or high-impact transitions in the experience replay buffer, thereby addressing issues in credit assignment, catastrophic forgetting, and memory resource constraints. Emerging variants leverage both real and synthetic contrastive samples, differentiating CER from conventional experience replay which samples uniformly or based on prediction error.
1. Motivation and Contrasts with Standard Experience Replay
Standard experience replay (ER), as introduced in the early 1990s, stores experienced transitions in a buffer to enable decorrelated and more stable updates under off-policy RL and continual learning. Classic ER treats all stored transitions uniformly or, in prioritized variants, selects based on prediction error magnitude. Memory demands scale with dataset size, constraining utility in streaming or lifelong learning contexts.
Contrastive Experience Replay advances this paradigm by focusing replay on transitions that are hypothesized, via task-specific measures, to be causally influential for the agent’s reward. CER thereby addresses two main challenges: (i) improving sample efficiency by targeting informative or causally significant samples and (ii) circumventing the need to store large volumes of data by leveraging generative models or targeted selection (Mocanu et al., 2016, Khadilkar et al., 2022).
2. Methodologies of Contrastive Experience Replay
Two principal mechanisms have been formulated under the CER umbrella:
2.1. Contrastive Sample-Based Replay (Reinforcement Learning Focus)
In CER for reinforcement learning (Khadilkar et al., 2022), the method involves:
- Key Transition Identification: After each episode, transitions exhibiting the greatest state deviations and terminal rewards in the top or bottom percentile are identified.
- Contrastive Sample Construction: For each key transition , a contrastive sample from the same state but with an alternate action is sourced from the ordinary buffer.
- Dual-Buffer Sampling: Training batches are constructed by sampling a proportion (up to 25% as exploration decays) from the CER buffer, with the remainder from the standard buffer.
Mathematical criteria for key transition insertion involve thresholds on state deviation magnitude and reward percentiles. The operational distinction from prioritized ER is sample selection by causal potential rather than prediction error.
2.2. Generative Replay for Continual Learning
In continual learning, OCD_GR (Online Contrastive Divergence with Generative Replay) (Mocanu et al., 2016) constructs replay batches via:
- Synthetic Replay Generation: At each learning step, a Restricted Boltzmann Machine (RBM) trained online generates samples from its current estimated data distribution using Gibbs sampling; real past samples are not stored.
- Augmented Batch: These synthetic samples are concatenated with new incoming data, forming the input to contrastive divergence updates.
- Buffer-Free Training: The system possesses only model parameters and random seed/state for sampling; memory requirements scale with model size rather than data volume.
This generative replay regime facilitates retention of past knowledge without explicit storage, distinguishing it from buffer-based ER.
3. Algorithmic Details and Pseudocode Structure
3.1. CER in RL
- For episode :
- Collect transitions.
- For each :
- Compute .
- If above -quantile and in top/bottom percentile, insert and (contrastive) into CER buffer.
- During training:
- Draw batch: fraction from CER buffer, rest from standard buffer.
3.2. OCD_GR for Continual Unsupervised/Supervised Learning
- For incoming minibatch :
- Generate synthetic samples from RBM via Gibbs sampling.
- Form training batch .
Apply contrastive divergence update:
4. Empirical Evaluation and Results
4.1. RL Task Evaluation
In 2D grid navigation tasks (up to 12x12 grids), CER consistently outperforms vanilla DQN and prioritized experience replay (PER), exhibiting:
- Faster and higher reward convergence.
- Superior performance in sparse/reward-complex environments.
- Enhanced separation of Q-values at critical decision points.
- The integration of contrastive samples yields further improvements over CER-nocontrast (key transitions only).
Network architectures in experiments include 3 input features, hidden layers (32, 8 units), and ReLU activations. Monte-Carlo backup is applied, but methods generalize to TD learning and actor-critic algorithms.
4.2. Continual Learning Evaluation
OCD_GR matches or exceeds OCD_ER (online CD with buffer-based replay) on nine real-world datasets (Gaussian mixture, binary sequence, binarized MNIST, Fashion-MNIST, permuted MNIST):
- In 64.28% of cases, OCD_GR outperforms ER; otherwise, performance is nearly equal.
- Negative log-likelihood (NLL) and classification accuracy largely match buffer-based ER.
- Memory footprint is substantially reduced, scaling only with RBM parameter size.
- Catastrophic forgetting is mitigated by synthetic replay.
5. Comparative Summary
| Aspect | CER (RL) / OCD_GR (CL) | Conventional ER / OCD_ER |
|---|---|---|
| Contrastive selection | Key transitions + counterfactuals | Uniform or prioritized by TD error |
| Replay sample type | Real transitions (+ synthetic in CL) | Real transitions |
| Memory scaling | Model size/bin (CL) or focused (RL) | Buffer size (proportional to data) |
| Performance on tasks | Superior sample efficiency, retention | Plateau or inferior in sparse tasks |
| Retention of past knowledge | Yes (causally informative or synthetic) | Yes (direct sample buffering) |
6. Implications and Broader Impact
Contrastive Experience Replay, in both RL and continual learning contexts, demonstrates that focused sampling (by causal informativeness or generative replay) improves sample efficiency, learning speed, and long-term retention relative to traditional replay methods. Implications include:
- Feasibility of replay in privacy- or memory-constrained domains via generative models (Mocanu et al., 2016).
- Applicability of contrastive replay to a wide spectrum of off-policy RL algorithms without architectural changes (Khadilkar et al., 2022).
- Potential for improved credit assignment in temporally extended, sparse-reward environments.
- Scalability to lifelong learning regimes and generalizability to various model architectures.
A plausible implication is that future research may extend CER to high-dimensional, continuous action spaces and combine contrastive replay ideas with advanced generative models for hybrid memory systems.
7. Practical Considerations and Limitations
CER and OCD_GR implementations require only modifications to replay buffer management, not policy or network architecture. Hyperparameters include state deviation thresholds, reward percentiles, and CER buffer sampling fractions. No relabeling or explicit causal modeling is necessary; all mechanisms operate on observed trajectories and original reward functions.
Limitations include reliance on effective identification of causally informative transitions in CER, and on the generative capacity and continual training stability of models such as RBMs in OCD_GR. CER performance gains are more pronounced in tasks with sparse rewards and complex credit assignment and may be less evident in dense or trivial domains.
References:
- "Online Contrastive Divergence with Generative Replay: Experience Replay without Storing Data" (Mocanu et al., 2016)
- "Using Contrastive Samples for Identifying and Leveraging Possible Causal Relationships in Reinforcement Learning" (Khadilkar et al., 2022)
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free