Combined Experience Replay (CER)

Updated 7 November 2025

CER is a reinforcement learning technique that guarantees the inclusion of the most recent transition in every training batch.
It improves sample efficiency and stability, particularly in non-stationary or sparse-reward environments, by reducing the lag in learning fresh data.
Its simple and composable design allows seamless integration with methods like PER, HER, and continual learning strategies to enhance performance.

Combined Experience Replay (CER) is a technique in reinforcement learning and sequential decision-making that ensures each training batch includes the most recent experience collected by the agent, complementing standard experience replay strategies and addressing the delayed incorporation of fresh transitions from environments with large replay buffers. CER has been further extended across various domains—including continual learning, causal RL, and LLM agents—often serving as a foundational or compositional mechanism for self-improvement, stability, and sample efficiency.

1. Concept and Motivation

Standard experience replay (ER) samples batches for training uniformly or based on importance metrics from a buffer of stored transitions. In large buffers, recent experiences may rarely be selected, which impedes rapid learning from up-to-date environmental or policy dynamics. CER guarantees inclusion of the latest transition in every training mini-batch, ensuring that the most recent information is incorporated immediately. This property is beneficial in environments characterized by non-stationarity, sparse rewards, and domains requiring rapid adaptation (Wan et al., 2018, Yenicesu et al., 13 Jun 2024).

2. Algorithmic Implementation

The canonical CER mechanism operates as follows:

For every gradient update, sample (N-1) transitions from the replay buffer using the current sampling strategy (uniform, prioritized, etc.).
Always append the most recent experience—i.e., the transition most recently added to the buffer—to form a complete batch of size N.
Perform the parameter update on this batch.

Pseudocode representation:

def sample_batch(buffer, batch_size):
    # Sample N-1 transitions via usual method
    batch = buffer.sample(batch_size - 1)
    # Add the latest transition
    batch.append(buffer.latest())
    return batch

No additional hyperparameters beyond buffer size are introduced compared to Prioritized Experience Replay (PER) or other advanced replay techniques (Wan et al., 2018).

3. Comparative Analysis with Other Experience Replay Methods

Technique	Key Principle	Batch Formation
CER	Always includes latest transition	(N-1) randomly sampled + latest
PER	Samples based on TD error priority	N sampled by priority metric
HER	Relabels transitions by achieved goals	N sampled (may include relabelled goals)

CER is orthogonal and easily composable with PER and HER. CER+PER: PER selects the batch, CER ensures latest is in every batch. CER+HER: Batch includes latest, with optional goal relabelling.
CER guarantees immediate learning from new experiences, a property not ensured by PER, which may stochastically omit the latest transition even if it is high priority.
CER is particularly beneficial in settings where stale experiences are problematic due to slow buffer turnover, or when learning signal from new transitions is critical for sample efficiency.

4. Extensions and Hybridizations

CER has inspired or been integrated with enhanced replay and continual learning methods in several research directions:

Contextual Experience Replay (LLM agents): CER serves as the principle underlying "Contextual Experience Replay," which enables language agents to accumulate and synthesize prior experiences (environment dynamics, decision-making skills) in a dynamic memory buffer and retrieve relevant knowledge in context for new tasks. Unlike classic RL, no parameter updates occur; instead, the agent's prompt/context is augmented with natural language summaries of past experiences during inference, offering substantial improvements in WebArena and VisualWebArena benchmarks (success rate up to 36.7%, +51% over baseline) (Liu et al., 7 Jun 2025). The CER buffer stores generalized high-level skills and environment summaries, supporting online, offline, and hybrid learning paradigms.
Continual Learning and Drift Adaptation: CER-inspired mechanisms form the basis for sophisticated continual learning algorithms, such as centroid-driven memory with reactive subspace buffers. These approaches unify memory retention and adaptation to concept drift by organizing experiences into clusters (centroids) and enabling cluster updating, label switching, and splitting based on drift in incoming data. Clusters are sampled for replay only when sufficiently "pure," enforcing both retention and forgetting of outdated concepts (Korycki et al., 2021).
Combined and Corrected Replay: Corrected Uniform Experience Replay (CUER) refines CER and uniform replay methods by stochastically adjusting transition priorities to promote fairness, ensure recent transitions are sampled early, and reduce off-policy update frequency. CUER may be combined with CER (CER+CUER), resulting in even faster convergence and lower variance in deep RL benchmarks (Yenicesu et al., 13 Jun 2024).
Contrastive Replay for Causal RL: CER is extended to "Contrastive Experience Replay," which identifies transitions with large state jumps correlated with extreme rewards and augments the replay buffer with contrastive samples (similar states, different actions), improving credit assignment and learning in delayed-reward settings (Khadilkar et al., 2022).

5. Experimental Results

Evidence from diverse domains demonstrates that CER and CER-like approaches provide notable gains in sample efficiency, stability, and final performance, especially in sparse or non-stationary environments.

Classic RL (DQN/DDPG): In CartPole-v0, MountainCar-v0, CER reduced episodes to convergence from 33,000 (baseline) to 15,000. In LunarLander-v2, CER was slightly slower (4,500 vs 3,500), indicating environment-dependent value (Wan et al., 2018).
LLM Agent Benchmarks: On WebArena, hybrid CER achieved 36.7% success rate (+51% over baseline), with strong stability (93% retention, +41% generalization to cross-template problems), and high token efficiency compared to tree search agents (Liu et al., 7 Jun 2025).
Continual Learning: In class-incremental and drift-affected tasks, CER-inspired cluster-based buffers outperform standard replay by rapidly adapting to concept drift while mitigating catastrophic forgetting (Korycki et al., 2021).
Off-policy RL (CUER): CER+CUER outperforms CER alone and other methods in continuous control benchmarks, yielding higher sample efficiency, stable policy updates, and better robustness to buffer bias (Yenicesu et al., 13 Jun 2024).
Causal RL: Contrastive CER improved both learning speed and Q-value discrimination compared to PER and standard ER in 2D navigation with delayed rewards (Khadilkar et al., 2022).

6. Limitations and Context of Application

CER is not universally optimal. In environments with dense rewards or rapid buffer turnover, forced inclusion of the latest transition may disrupt smooth learning dynamics or provide negligible benefit. In such cases, sophisticated prioritization or goal relabelling (PER, HER) may be more advantageous. Nonetheless, CER remains a compositional mechanism, often forming the basis for hybrid or advanced experience replay approaches where immediate adaptation is required (Wan et al., 2018, Liu et al., 7 Jun 2025).

7. Practical Considerations and Future Directions

CER's simplicity—requiring only the addition of the latest experience to each batch—yields broad applicability without added complexity or tuning parameters.
CER is synergistic with prioritized, contrastive, and dynamic replay strategies, and can be integrated with buffer management schemes (e.g., CUER) for further stability and fairness.
Extensions to semantically rich replay (natural language skills, centroid clustering, causality detection) suggest promising directions for continual adaptation in non-stationary, data-rich, or LLM-agent environments.
The fundamental principle of guaranteeing immediate incorporation of up-to-date experiences underlies ongoing research into scalable, adaptive experience replay frameworks for deep and sequential RL in diverse application domains.

References

"Advances in Experience Replay" (Wan et al., 2018)
"Contextual Experience Replay for Self-Improvement of Language Agents" (Liu et al., 7 Jun 2025)
"CUER: Corrected Uniform Experience Replay for Off-Policy Continuous Deep Reinforcement Learning Algorithms" (Yenicesu et al., 13 Jun 2024)
"Off-Policy Actor-Critic with Shared Experience Replay" (Schmitt et al., 2019)
"Class-Incremental Experience Replay for Continual Learning under Concept Drift" (Korycki et al., 2021)
"Using Contrastive Samples for Identifying and Leveraging Possible Causal Relationships in Reinforcement Learning" (Khadilkar et al., 2022)