Papers
Topics
Authors
Recent
Search
2000 character limit reached

Map-based Experience Replay (GWR-R)

Updated 14 February 2026
  • The paper introduces a GWR-based adaptive replay memory that compacts redundant experiences while preserving reinforcement learning performance.
  • It leverages a self-organizing network to form an adaptive map of state prototypes and temporal transitions, optimizing both memory and sampling efficiency.
  • Experimental results on MuJoCo tasks show up to 80% memory reduction with performance loss under 10%, highlighting its practicality in resource-limited settings.

Map-based Experience Replay (GWR-R) is a cognitive-inspired replay memory for Deep Reinforcement Learning (RL) that addresses the problem of catastrophic forgetting by compactly storing transitions in a structured, adaptive map. It leverages the Grow-When-Required (GWR) self-organizing network to merge redundant experiences and maintain a concise, environment-model-like graph memory, significantly reducing memory requirements while retaining most of the original performance (Hafez et al., 2023).

1. Grow-When-Required (GWR) Network for Transition Storage

The GWR backbone forms an adaptive, growing graph of nodes ("prototypes"), each representing clusters of similar states in the environment. Directed temporal edges encode transitions between states, storing averaged actions, rewards, and visit counts.

  • Best Matching Unit (BMU) Selection: For a given input state xRnx\in\mathbb{R}^n, the BMU bb is identified by minimum Euclidean distance: d(x,wi)=xwi2d(x, w_i) = \| x - w_i \|_2.
  • Node Structure: Each node ii contains a prototype wiRnw_i \in \mathbb{R}^n, an activation ai=exp(d(x,wi))a_i = \exp(-d(x, w_i)), and a habituation counter hi(0,1]h_i \in (0,1], with hih_i decaying after each BMU selection.
  • Node Insertion and Adaptation: Two thresholds regulate map growth: activation threshold aTa_T and habituation threshold hTh_T. If ab<aTa_b < a_T and hb<hTh_b < h_T, a new node kk (wk:=x,hk:=1w_k := x, h_k := 1) is created and connected to bb and the second BMU ss.
  • Weight and Edge Aging Updates: If adaptation occurs instead of insertion, the BMU and its neighbors are nudged toward xx, and neighborhood edges are reset or aged. Stale edges (age >agemax> \mathrm{age}_{\max}) and isolated nodes are pruned.

Habituation counter update follows hihi+τik(1hi)kh_i \leftarrow h_i + \tau_i k (1-h_i) - k, encouraging node specialization over repeated usage.

2. Temporal Edges and Transition Encoding

Beyond graph structure, GWR-R encodes temporal transitions via three adjacency matrices of size N×NN \times N (where NN is the current node count):

  • Transition count: TCi,jTC_{i,j} records the frequency of observed iji \rightarrow j transitions.
  • Action: TAi,jTA_{i,j} stores exponentially averaged actions for transitions from ii to jj using TAi,jξat+(1ξ)TAi,jTA_{i,j} \leftarrow \xi a_t + (1-\xi) TA_{i,j}, where ata_t is the observed action and ξ0.7\xi \approx 0.7.
  • Reward: TRi,jTR_{i,j} tracks exponentially averaged rewards similarly.
  • Done Flag: djφdt+(1φ)djd_j \leftarrow \varphi d_t + (1 - \varphi) d_j, integrating episode termination information (φ0.7\varphi \approx 0.7).

This abstraction maintains a high-density summary of the transition dynamics while avoiding redundancy.

3. GWR-R Algorithmic Implementation

The approach substitutes the conventional replay buffer in off-policy RL algorithms (e.g., DDPG):

  • Experience Storage: Upon each environment interaction, GWRR_add(st+1,at,rt,dt)GWR_R\_add(s_{t+1}, a_t, r_t, d_t) either inserts a new node/edge or updates existing structures.
  • Batch Sampling: During training, sampled batches (s,a,s,r,d)(s, a, s', r, d) are generated by uniformly picking a node ii, randomly selecting a successor jj with probability TCi,j\propto TC_{i,j}, and returning averaged transitions.

Summary pseudocode (extract):

1
2
3
4
5
6
For each episode:
    1. Select action (random during warmup, otherwise learned policy)
    2. Take step, observe (s_next, r, d)
    3. GWR_R_add(s_next, a, r, d)
    4. If sufficient data, sample batch via GWR_R_sample and train
    5. On episode end, store initial state as new node if needed
Transitions are provided to the RL algorithm in tuple form (s,a,s,r,d)(s, a, s', r, d), matching conventional replay buffer interfaces (Hafez et al., 2023).

4. Hyperparameters and Memory-Performance Trade-offs

Critical hyperparameters include:

Parameter Typical Range Effect
aTa_T (activation) 0.9 – 0.98 High aTa_T yields little merging (baseline-like memory), lower aTa_T increases merging and memory savings
hTh_T (habituation) 1\approx 1 Strong influence on transition integrity; generally fixed unless carefully tuned
ϵb\epsilon_b / ϵn\epsilon_n 0.1 / 0.005 Learning rates; higher ϵn\epsilon_n accelerates adaptation but increases prototype drift
agemax_{\max} 10 Edge staleness threshold
ξ\xi, φ\varphi 0.1–0.9 (typ. 0.7) Exponential averaging for actions and done-flags

Empirical evidence on MuJoCo tasks (e.g., InvertedPendulum, Reacher, HalfCheetah) demonstrates that reducing aTa_T (increasing merging) achieves substantial memory compression—with 30–75% of baseline memory incurring performance loss below 10% in typical regimes. More aggressive compression degrades performance rapidly.

Method Memory Retained Avg. Return Relative Perf.
Uniform ER (baseline) 100% 1000±20 1.00
GWR-R (aT=0.96a_T=0.96) 75% 970±50 0.97
GWR-R (aT=0.92a_T=0.92) 30% 820±120 0.82
GWR-R (aT=0.88a_T=0.88) 15% 300±200 0.30

Reacher achieves up to 80% memory reduction without performance loss; HalfCheetah maintains baseline return at roughly 60% memory (Hafez et al., 2023).

5. Integration with Off-policy RL Algorithms

GWR-R is compatible with off-policy algorithms such as DQN, DDPG, SAC, and TD3 by substituting the replay buffer interface:

  • Experience Collection: Standard buffer insert replaced by GWRR_addGWR_R\_add.
  • Experience Sampling: Mini-batch sampling replaced by GWRR_sampleGWR_R\_sample, which returns batches in the (s,a,s,r,d)(s,a,s',r,d) format.

No additional modifications to learner architectures are necessary. Prioritized Experience Replay (PER) integration is possible by reweighting nodes by TCi,jTC_{i,j} or TD-error statistics.

6. Empirical Findings and Computational Analysis

All experiments utilized DDPG on MuJoCo control tasks with 100K environment steps across 10 seeds. The key findings include:

  • Memory Efficiency: 40–80% replay memory reduction with less than 10% performance loss in standard regimes.
  • Computation: GWR-R incurs a 2–10× computational overhead compared to raw buffers, principally due to nearest-neighbor operations. This is considered acceptable in memory- or hardware-constrained settings.

The principal performance bottleneck is prototype drift, where the averaged action on an edge may not exactly reproduce a real environment transition as the partition grows. The activation threshold aTa_T offers control over this effect by limiting region size (Hafez et al., 2023).

7. Discussion, Limitations, and Extensions

Activation-based compression preserves transition quality better than habituation-driven node insertions. The primary limitation is drift caused by excessive merging, which must be balanced against memory constraints. Potential extensions include:

  • Vision-based RL: Learning latent GWR maps over image representations.
  • Continual and Multi-task RL: Leveraging long-range temporal edges for multi-task replay.
  • Multimodal Maps: Incorporating different sensory modalities such as sound and vision.

A plausible implication is that map-based replay schemes like GWR-R could be particularly beneficial in embedded or robotic contexts where memory is limited and storage of all raw experiences is infeasible (Hafez et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Map-based Experience Replay (GWR-R).