Map-based Experience Replay (GWR-R)
- The paper introduces a GWR-based adaptive replay memory that compacts redundant experiences while preserving reinforcement learning performance.
- It leverages a self-organizing network to form an adaptive map of state prototypes and temporal transitions, optimizing both memory and sampling efficiency.
- Experimental results on MuJoCo tasks show up to 80% memory reduction with performance loss under 10%, highlighting its practicality in resource-limited settings.
Map-based Experience Replay (GWR-R) is a cognitive-inspired replay memory for Deep Reinforcement Learning (RL) that addresses the problem of catastrophic forgetting by compactly storing transitions in a structured, adaptive map. It leverages the Grow-When-Required (GWR) self-organizing network to merge redundant experiences and maintain a concise, environment-model-like graph memory, significantly reducing memory requirements while retaining most of the original performance (Hafez et al., 2023).
1. Grow-When-Required (GWR) Network for Transition Storage
The GWR backbone forms an adaptive, growing graph of nodes ("prototypes"), each representing clusters of similar states in the environment. Directed temporal edges encode transitions between states, storing averaged actions, rewards, and visit counts.
- Best Matching Unit (BMU) Selection: For a given input state , the BMU is identified by minimum Euclidean distance: .
- Node Structure: Each node contains a prototype , an activation , and a habituation counter , with decaying after each BMU selection.
- Node Insertion and Adaptation: Two thresholds regulate map growth: activation threshold and habituation threshold . If and , a new node () is created and connected to and the second BMU .
- Weight and Edge Aging Updates: If adaptation occurs instead of insertion, the BMU and its neighbors are nudged toward , and neighborhood edges are reset or aged. Stale edges (age ) and isolated nodes are pruned.
Habituation counter update follows , encouraging node specialization over repeated usage.
2. Temporal Edges and Transition Encoding
Beyond graph structure, GWR-R encodes temporal transitions via three adjacency matrices of size (where is the current node count):
- Transition count: records the frequency of observed transitions.
- Action: stores exponentially averaged actions for transitions from to using , where is the observed action and .
- Reward: tracks exponentially averaged rewards similarly.
- Done Flag: , integrating episode termination information ().
This abstraction maintains a high-density summary of the transition dynamics while avoiding redundancy.
3. GWR-R Algorithmic Implementation
The approach substitutes the conventional replay buffer in off-policy RL algorithms (e.g., DDPG):
- Experience Storage: Upon each environment interaction, either inserts a new node/edge or updates existing structures.
- Batch Sampling: During training, sampled batches are generated by uniformly picking a node , randomly selecting a successor with probability , and returning averaged transitions.
Summary pseudocode (extract):
1 2 3 4 5 6 |
For each episode:
1. Select action (random during warmup, otherwise learned policy)
2. Take step, observe (s_next, r, d)
3. GWR_R_add(s_next, a, r, d)
4. If sufficient data, sample batch via GWR_R_sample and train
5. On episode end, store initial state as new node if needed |
4. Hyperparameters and Memory-Performance Trade-offs
Critical hyperparameters include:
| Parameter | Typical Range | Effect |
|---|---|---|
| (activation) | 0.9 – 0.98 | High yields little merging (baseline-like memory), lower increases merging and memory savings |
| (habituation) | Strong influence on transition integrity; generally fixed unless carefully tuned | |
| / | 0.1 / 0.005 | Learning rates; higher accelerates adaptation but increases prototype drift |
| age | 10 | Edge staleness threshold |
| , | 0.1–0.9 (typ. 0.7) | Exponential averaging for actions and done-flags |
Empirical evidence on MuJoCo tasks (e.g., InvertedPendulum, Reacher, HalfCheetah) demonstrates that reducing (increasing merging) achieves substantial memory compression—with 30–75% of baseline memory incurring performance loss below 10% in typical regimes. More aggressive compression degrades performance rapidly.
| Method | Memory Retained | Avg. Return | Relative Perf. |
|---|---|---|---|
| Uniform ER (baseline) | 100% | 1000±20 | 1.00 |
| GWR-R () | 75% | 970±50 | 0.97 |
| GWR-R () | 30% | 820±120 | 0.82 |
| GWR-R () | 15% | 300±200 | 0.30 |
Reacher achieves up to 80% memory reduction without performance loss; HalfCheetah maintains baseline return at roughly 60% memory (Hafez et al., 2023).
5. Integration with Off-policy RL Algorithms
GWR-R is compatible with off-policy algorithms such as DQN, DDPG, SAC, and TD3 by substituting the replay buffer interface:
- Experience Collection: Standard buffer insert replaced by .
- Experience Sampling: Mini-batch sampling replaced by , which returns batches in the format.
No additional modifications to learner architectures are necessary. Prioritized Experience Replay (PER) integration is possible by reweighting nodes by or TD-error statistics.
6. Empirical Findings and Computational Analysis
All experiments utilized DDPG on MuJoCo control tasks with 100K environment steps across 10 seeds. The key findings include:
- Memory Efficiency: 40–80% replay memory reduction with less than 10% performance loss in standard regimes.
- Computation: GWR-R incurs a 2–10× computational overhead compared to raw buffers, principally due to nearest-neighbor operations. This is considered acceptable in memory- or hardware-constrained settings.
The principal performance bottleneck is prototype drift, where the averaged action on an edge may not exactly reproduce a real environment transition as the partition grows. The activation threshold offers control over this effect by limiting region size (Hafez et al., 2023).
7. Discussion, Limitations, and Extensions
Activation-based compression preserves transition quality better than habituation-driven node insertions. The primary limitation is drift caused by excessive merging, which must be balanced against memory constraints. Potential extensions include:
- Vision-based RL: Learning latent GWR maps over image representations.
- Continual and Multi-task RL: Leveraging long-range temporal edges for multi-task replay.
- Multimodal Maps: Incorporating different sensory modalities such as sound and vision.
A plausible implication is that map-based replay schemes like GWR-R could be particularly beneficial in embedded or robotic contexts where memory is limited and storage of all raw experiences is infeasible (Hafez et al., 2023).