Papers
Topics
Authors
Recent
2000 character limit reached

Contrastive Experience Replay

Updated 7 November 2025
  • Contrastive Experience Replay is a technique that selects transitions based on their causal impact to improve credit assignment and sample efficiency.
  • It combines key transitions and contrastive samples in RL with generative synthetic replay in continual learning to overcome memory and scalability challenges.
  • Empirical evaluations show faster convergence and reduced catastrophic forgetting compared to conventional uniform or error-prioritized replay methods.

Contrastive Experience Replay (CER) denotes a family of sample selection and rehearsal mechanisms in reinforcement learning and continual learning that prioritize transitions or synthetic samples based on their contrastive or causally informative properties. CER algorithms emphasize counterfactual or high-impact transitions in the experience replay buffer, thereby addressing issues in credit assignment, catastrophic forgetting, and memory resource constraints. Emerging variants leverage both real and synthetic contrastive samples, differentiating CER from conventional experience replay which samples uniformly or based on prediction error.

1. Motivation and Contrasts with Standard Experience Replay

Standard experience replay (ER), as introduced in the early 1990s, stores experienced transitions in a buffer to enable decorrelated and more stable updates under off-policy RL and continual learning. Classic ER treats all stored transitions uniformly or, in prioritized variants, selects based on prediction error magnitude. Memory demands scale with dataset size, constraining utility in streaming or lifelong learning contexts.

Contrastive Experience Replay advances this paradigm by focusing replay on transitions that are hypothesized, via task-specific measures, to be causally influential for the agent’s reward. CER thereby addresses two main challenges: (i) improving sample efficiency by targeting informative or causally significant samples and (ii) circumventing the need to store large volumes of data by leveraging generative models or targeted selection (Mocanu et al., 2016, Khadilkar et al., 2022).

2. Methodologies of Contrastive Experience Replay

Two principal mechanisms have been formulated under the CER umbrella:

2.1. Contrastive Sample-Based Replay (Reinforcement Learning Focus)

In CER for reinforcement learning (Khadilkar et al., 2022), the method involves:

  • Key Transition Identification: After each episode, transitions exhibiting the greatest state deviations and terminal rewards in the top or bottom percentile are identified.
  • Contrastive Sample Construction: For each key transition (st,at,τt)(s_t, a_t, \tau_t), a contrastive sample from the same state sts_t but with an alternate action aata' \neq a_t is sourced from the ordinary buffer.
  • Dual-Buffer Sampling: Training batches are constructed by sampling a proportion (up to 25% as exploration decays) from the CER buffer, with the remainder from the standard buffer.

Mathematical criteria for key transition insertion involve thresholds on state deviation magnitude and reward percentiles. The operational distinction from prioritized ER is sample selection by causal potential rather than prediction error.

2.2. Generative Replay for Continual Learning

In continual learning, OCD_GR (Online Contrastive Divergence with Generative Replay) (Mocanu et al., 2016) constructs replay batches via:

  • Synthetic Replay Generation: At each learning step, a Restricted Boltzmann Machine (RBM) trained online generates samples from its current estimated data distribution using Gibbs sampling; real past samples are not stored.
  • Augmented Batch: These synthetic samples are concatenated with new incoming data, forming the input to contrastive divergence updates.
  • Buffer-Free Training: The system possesses only model parameters and random seed/state for sampling; memory requirements scale with model size rather than data volume.

This generative replay regime facilitates retention of past knowledge without explicit storage, distinguishing it from buffer-based ER.

3. Algorithmic Details and Pseudocode Structure

3.1. CER in RL

  • For episode ee:
    • Collect (st,at,rt)(s_t, a_t, r_t) transitions.
    • For each tt:
    • Compute Δ(st,st+1)=st+1st2\Delta(s_t, s_{t+1}) = \|s_{t+1} - s_t\|_2.
    • If above δ\delta-quantile and τt\tau_t in top/bottom ψ\psi percentile, insert (st,at,τt)(s_t, a_t, \tau_t) and (st,a,τ)(s_t, a', \tau') (contrastive) into CER buffer.
  • During training:
    • Draw batch: fraction from CER buffer, rest from standard buffer.

3.2. OCD_GR for Continual Unsupervised/Supervised Learning

  • For incoming minibatch BtB_t:
    • Generate synthetic samples StS_t from RBM via Gibbs sampling.
    • Form training batch BtStB_t \cup S_t.
    • Apply contrastive divergence update:

      Δw=η(vhBtStvhmodel)\Delta w = \eta \left( \langle v h \rangle_{B_t \cup S_t} - \langle v h \rangle_{\text{model}} \right)

4. Empirical Evaluation and Results

4.1. RL Task Evaluation

In 2D grid navigation tasks (up to 12x12 grids), CER consistently outperforms vanilla DQN and prioritized experience replay (PER), exhibiting:

  • Faster and higher reward convergence.
  • Superior performance in sparse/reward-complex environments.
  • Enhanced separation of Q-values at critical decision points.
  • The integration of contrastive samples yields further improvements over CER-nocontrast (key transitions only).

Network architectures in experiments include 3 input features, hidden layers (32, 8 units), and ReLU activations. Monte-Carlo backup is applied, but methods generalize to TD learning and actor-critic algorithms.

4.2. Continual Learning Evaluation

OCD_GR matches or exceeds OCD_ER (online CD with buffer-based replay) on nine real-world datasets (Gaussian mixture, binary sequence, binarized MNIST, Fashion-MNIST, permuted MNIST):

  • In 64.28% of cases, OCD_GR outperforms ER; otherwise, performance is nearly equal.
  • Negative log-likelihood (NLL) and classification accuracy largely match buffer-based ER.
  • Memory footprint is substantially reduced, scaling only with RBM parameter size.
  • Catastrophic forgetting is mitigated by synthetic replay.

5. Comparative Summary

Aspect CER (RL) / OCD_GR (CL) Conventional ER / OCD_ER
Contrastive selection Key transitions + counterfactuals Uniform or prioritized by TD error
Replay sample type Real transitions (+ synthetic in CL) Real transitions
Memory scaling Model size/bin (CL) or focused (RL) Buffer size (proportional to data)
Performance on tasks Superior sample efficiency, retention Plateau or inferior in sparse tasks
Retention of past knowledge Yes (causally informative or synthetic) Yes (direct sample buffering)

6. Implications and Broader Impact

Contrastive Experience Replay, in both RL and continual learning contexts, demonstrates that focused sampling (by causal informativeness or generative replay) improves sample efficiency, learning speed, and long-term retention relative to traditional replay methods. Implications include:

  • Feasibility of replay in privacy- or memory-constrained domains via generative models (Mocanu et al., 2016).
  • Applicability of contrastive replay to a wide spectrum of off-policy RL algorithms without architectural changes (Khadilkar et al., 2022).
  • Potential for improved credit assignment in temporally extended, sparse-reward environments.
  • Scalability to lifelong learning regimes and generalizability to various model architectures.

A plausible implication is that future research may extend CER to high-dimensional, continuous action spaces and combine contrastive replay ideas with advanced generative models for hybrid memory systems.

7. Practical Considerations and Limitations

CER and OCD_GR implementations require only modifications to replay buffer management, not policy or network architecture. Hyperparameters include state deviation thresholds, reward percentiles, and CER buffer sampling fractions. No relabeling or explicit causal modeling is necessary; all mechanisms operate on observed trajectories and original reward functions.

Limitations include reliance on effective identification of causally informative transitions in CER, and on the generative capacity and continual training stability of models such as RBMs in OCD_GR. CER performance gains are more pronounced in tasks with sparse rewards and complex credit assignment and may be less evident in dense or trivial domains.


References:

  • "Online Contrastive Divergence with Generative Replay: Experience Replay without Storing Data" (Mocanu et al., 2016)
  • "Using Contrastive Samples for Identifying and Leveraging Possible Causal Relationships in Reinforcement Learning" (Khadilkar et al., 2022)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Contrastive Experience Replay.