Map-based Experience Replay (GWR-R)

Updated 14 February 2026

The paper introduces a GWR-based adaptive replay memory that compacts redundant experiences while preserving reinforcement learning performance.
It leverages a self-organizing network to form an adaptive map of state prototypes and temporal transitions, optimizing both memory and sampling efficiency.
Experimental results on MuJoCo tasks show up to 80% memory reduction with performance loss under 10%, highlighting its practicality in resource-limited settings.

Map-based Experience Replay (GWR-R) is a cognitive-inspired replay memory for Deep Reinforcement Learning (RL) that addresses the problem of catastrophic forgetting by compactly storing transitions in a structured, adaptive map. It leverages the Grow-When-Required (GWR) self-organizing network to merge redundant experiences and maintain a concise, environment-model-like graph memory, significantly reducing memory requirements while retaining most of the original performance (Hafez et al., 2023).

1. Grow-When-Required (GWR) Network for Transition Storage

The GWR backbone forms an adaptive, growing graph of nodes ("prototypes"), each representing clusters of similar states in the environment. Directed temporal edges encode transitions between states, storing averaged actions, rewards, and visit counts.

Best Matching Unit (BMU) Selection: For a given input state $x\in\mathbb{R}^n$ , the BMU $b$ is identified by minimum Euclidean distance: $d(x, w_i) = \| x - w_i \|_2$ .
Node Structure: Each node $i$ contains a prototype $w_i \in \mathbb{R}^n$ , an activation $a_i = \exp(-d(x, w_i))$ , and a habituation counter $h_i \in (0,1]$ , with $h_i$ decaying after each BMU selection.
Node Insertion and Adaptation: Two thresholds regulate map growth: activation threshold $a_T$ and habituation threshold $h_T$ . If $a_b < a_T$ and $h_b < h_T$ , a new node $k$ ( $w_k := x, h_k := 1$ ) is created and connected to $b$ and the second BMU $s$ .
Weight and Edge Aging Updates: If adaptation occurs instead of insertion, the BMU and its neighbors are nudged toward $x$ , and neighborhood edges are reset or aged. Stale edges (age $> \mathrm{age}_{\max}$ ) and isolated nodes are pruned.

Habituation counter update follows $h_i \leftarrow h_i + \tau_i k (1-h_i) - k$ , encouraging node specialization over repeated usage.

2. Temporal Edges and Transition Encoding

Beyond graph structure, GWR-R encodes temporal transitions via three adjacency matrices of size $N \times N$ (where $N$ is the current node count):

Transition count: $TC_{i,j}$ records the frequency of observed $i \rightarrow j$ transitions.
Action: $TA_{i,j}$ stores exponentially averaged actions for transitions from $i$ to $j$ using $TA_{i,j} \leftarrow \xi a_t + (1-\xi) TA_{i,j}$ , where $a_t$ is the observed action and $\xi \approx 0.7$ .
Reward: $TR_{i,j}$ tracks exponentially averaged rewards similarly.
Done Flag: $d_j \leftarrow \varphi d_t + (1 - \varphi) d_j$ , integrating episode termination information ( $\varphi \approx 0.7$ ).

This abstraction maintains a high-density summary of the transition dynamics while avoiding redundancy.

3. GWR-R Algorithmic Implementation

The approach substitutes the conventional replay buffer in off-policy RL algorithms (e.g., DDPG):

Experience Storage: Upon each environment interaction, $GWR_R\_add(s_{t+1}, a_t, r_t, d_t)$ either inserts a new node/edge or updates existing structures.
Batch Sampling: During training, sampled batches $(s, a, s', r, d)$ are generated by uniformly picking a node $i$ , randomly selecting a successor $j$ with probability $\propto TC_{i,j}$ , and returning averaged transitions.

Summary pseudocode (extract):

For each episode:
    1. Select action (random during warmup, otherwise learned policy)
    2. Take step, observe (s_next, r, d)
    3. GWR_R_add(s_next, a, r, d)
    4. If sufficient data, sample batch via GWR_R_sample and train
    5. On episode end, store initial state as new node if needed

Transitions are provided to the RL algorithm in tuple form

(s, a, s', r, d)

, matching conventional replay buffer interfaces (Hafez et al., 2023).

4. Hyperparameters and Memory-Performance Trade-offs

Critical hyperparameters include:

Parameter	Typical Range	Effect
$a_T$ (activation)	0.9 – 0.98	High $a_T$ yields little merging (baseline-like memory), lower $a_T$ increases merging and memory savings
$h_T$ (habituation)	$\approx 1$	Strong influence on transition integrity; generally fixed unless carefully tuned
$\epsilon_b$ / $\epsilon_n$	0.1 / 0.005	Learning rates; higher $\epsilon_n$ accelerates adaptation but increases prototype drift
age $_{\max}$	10	Edge staleness threshold
$\xi$ , $\varphi$	0.1–0.9 (typ. 0.7)	Exponential averaging for actions and done-flags

Empirical evidence on MuJoCo tasks (e.g., InvertedPendulum, Reacher, HalfCheetah) demonstrates that reducing $a_T$ (increasing merging) achieves substantial memory compression—with 30–75% of baseline memory incurring performance loss below 10% in typical regimes. More aggressive compression degrades performance rapidly.

Method	Memory Retained	Avg. Return	Relative Perf.
Uniform ER (baseline)	100%	1000±20	1.00
GWR-R ( $a_T=0.96$ )	75%	970±50	0.97
GWR-R ( $a_T=0.92$ )	30%	820±120	0.82
GWR-R ( $a_T=0.88$ )	15%	300±200	0.30

Reacher achieves up to 80% memory reduction without performance loss; HalfCheetah maintains baseline return at roughly 60% memory (Hafez et al., 2023).

5. Integration with Off-policy RL Algorithms

GWR-R is compatible with off-policy algorithms such as DQN, DDPG, SAC, and TD3 by substituting the replay buffer interface:

Experience Collection: Standard buffer insert replaced by $GWR_R\_add$ .
Experience Sampling: Mini-batch sampling replaced by $GWR_R\_sample$ , which returns batches in the $(s,a,s',r,d)$ format.

No additional modifications to learner architectures are necessary. Prioritized Experience Replay (PER) integration is possible by reweighting nodes by $TC_{i,j}$ or TD-error statistics.

6. Empirical Findings and Computational Analysis

All experiments utilized DDPG on MuJoCo control tasks with 100K environment steps across 10 seeds. The key findings include:

Memory Efficiency: 40–80% replay memory reduction with less than 10% performance loss in standard regimes.
Computation: GWR-R incurs a 2–10× computational overhead compared to raw buffers, principally due to nearest-neighbor operations. This is considered acceptable in memory- or hardware-constrained settings.

The principal performance bottleneck is prototype drift, where the averaged action on an edge may not exactly reproduce a real environment transition as the partition grows. The activation threshold $a_T$ offers control over this effect by limiting region size (Hafez et al., 2023).

7. Discussion, Limitations, and Extensions

Activation-based compression preserves transition quality better than habituation-driven node insertions. The primary limitation is drift caused by excessive merging, which must be balanced against memory constraints. Potential extensions include:

Vision-based RL: Learning latent GWR maps over image representations.
Continual and Multi-task RL: Leveraging long-range temporal edges for multi-task replay.
Multimodal Maps: Incorporating different sensory modalities such as sound and vision.

A plausible implication is that map-based replay schemes like GWR-R could be particularly beneficial in embedded or robotic contexts where memory is limited and storage of all raw experiences is infeasible (Hafez et al., 2023).

Markdown Report Issue Upgrade to Chat

References (1)

Map-based Experience Replay: A Memory-Efficient Solution to Catastrophic Forgetting in Reinforcement Learning (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Map-based Experience Replay (GWR-R).

Map-based Experience Replay (GWR-R)

1. Grow-When-Required (GWR) Network for Transition Storage

2. Temporal Edges and Transition Encoding

3. GWR-R Algorithmic Implementation

4. Hyperparameters and Memory-Performance Trade-offs

5. Integration with Off-policy RL Algorithms

6. Empirical Findings and Computational Analysis

7. Discussion, Limitations, and Extensions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Map-based Experience Replay (GWR-R)

1. Grow-When-Required (GWR) Network for Transition Storage

2. Temporal Edges and Transition Encoding

3. GWR-R Algorithmic Implementation

4. Hyperparameters and Memory-Performance Trade-offs

5. Integration with Off-policy RL Algorithms

6. Empirical Findings and Computational Analysis

7. Discussion, Limitations, and Extensions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research