Memory-Augmented Self-Play
- The paper introduces memory-augmented self-play, where an external memory module guides task generation to improve exploration and convergence in reinforcement learning.
- It compares multiple memory encoding strategies—last episode, averaged, and LSTM—with LSTM-based memory yielding over a fivefold increase in task diversity and faster performance gains.
- Empirical results in Mazebase and Acrobot show that memory augmentation enhances sample efficiency and state-space coverage, leading to superior downstream fine-tuning.
Memory-Augmented Self-Play is an extension of intrinsic motivation-based reinforcement learning wherein an agent leverages a persistent external memory to enhance the diversity and efficacy of unsupervised task generation during self-play pretraining. This enhancement equips the task-generating agent with the ability to recall and condition its current exploratory behavior on a summary of previously proposed tasks, leading to accelerated state-space exploration and superior fine-tuning performance on downstream objectives (Sodhani et al., 2018).
1. Self-Play with Memory: Formulation and Motivation
The underlying paradigm derives from the unsupervised self-play setting introduced by Sukhbaatar et al. (2017), in which two copies of an agent—denoted Alice and Bob—alternate roles: Alice constructs an implicit "task" by traversing the environment from a random initial state , terminating upon issuing a designated \texttt{stop} action, at which point Bob is assigned the challenge of reaching Alice's terminal state from the same starting configuration. Rewards are purely intrinsic; Alice is incentivized to propose tasks of intermediate difficulty, maximizing the probability that Bob fails but penalized if Bob consistently succeeds too rapidly. Bob is rewarded strictly for efficient completion.
Memory-augmented self-play modifies this architecture by introducing an external memory module associated with Alice. At episode index , Alice's policy conditions not only on the starting state and current state at each time , but also on a memory encoding that summarizes high-level statistics or representations of all prior tasks (episodes ). This explicit conditioning allows Alice to generate a progressively more diverse curriculum of tasks, reducing redundancy and encouraging more comprehensive state-space coverage. Bob's policy is unaltered by memory.
2. Reinforcement Learning Objectives and Training Regime
Both agents are trained using REINFORCE with a learned value-function baseline. Let Alice's policy be , and Bob's policy . The expected return to maximize is
where each is determined by the intrinsic self-play game: for Alice, a binary reward at episode termination ( if Bob fails, if Bob succeeds too easily); for Bob, a per-step negative penalty and a terminal success bonus. The policy gradient is given by
with the time-to-goal return and a baseline. The key change with memory augmentation is the input to Alice's policy network, not the training objective itself.
3. Memory Architecture and Update Mechanisms
Experimental analysis compared three memory modules for encoding prior task summaries:
- Last Episode: Store only the feature vector of the immediately preceding episode.
- Last Episodes (Averaged): Store the average feature vector over the most recent episodes:
- LSTM Episode Memory: Process the sequence of episode summaries with an LSTM cell,
where each episode summary typically consists of a feature representation of the (start, end) state pair computed via a feed-forward network. Empirically, LSTM-based memory yielded the most rapid convergence and consistently stronger exploration.
4. Neural Architectures and Feature Representations
The core neural modules are:
- Feature Extractor : Single hidden-layer feed-forward network with ReLU nonlinearity. Alice’s input: concatenated ; Bob’s input: .
- Actor Network: Single feed-forward layer plus softmax, outputting action probabilities. For Alice, the output of the feature extractor is concatenated with memory .
- Critic Network: Single feed-forward layer producing scalar value estimate .
Feature dimensionality is environment-dependent:
| Environment | Alice (no memory) | Alice (with memory) | Bob |
|---|---|---|---|
| Mazebase | 50 | 100 | 50 |
| Acrobot | 10 | 20 | 10 |
The memory module augments only Alice’s input dimensionality; Bob’s feature encoder is unchanged.
5. Self-Play Training Protocol
A typical meta-episode proceeds as follows:
- Sample random initial state .
- Alice rollout: For to max steps, sample . Terminate on \texttt{stop} or max steps; final state is .
- Assign Bob: Task is to traverse from to .
- Bob rollout: For up to max steps, penalize each time step and provide a success reward on completion.
- Collect trajectories, compute intrinsic rewards, and update parameters via REINFORCE with baseline.
- Extract episode summary and update Alice's memory: .
- Periodically, train Bob on extrinsic "target task" batches.
Experimental configurations utilized PyTorch 0.3.1, Adam optimizer (lr = ), discount ; environments studied included Mazebase (8×8 grid) and Acrobot (continuous OpenAI Gym task). Memory-augmented self-play notably used larger feature representations for Alice.
6. Empirical Results and Exploration Analysis
Quantitative metrics indicate memory-augmented self-play produces faster convergence and richer exploration than standard variants.
Episodic Reward Comparison
Mazebase (average episodic reward, last 10K episodes):
| # episodes | No self-play | Self-play | Mem-aug self-play |
|---|---|---|---|
| 100K | -4.841 | -4.650 | -4.580 |
| 200K | -4.419 | -4.253 | -3.899 |
| 300K | -3.861 | -3.686 | -3.130 |
| 400K | -3.357 | -3.265 | -2.758 |
| 500K | -3.076 | -2.942 | -2.541 |
| 600K | -2.782 | -2.802 | -2.528 |
| 700K | -2.669 | -2.564 | -2.516 |
Acrobot (average episodic reward, last 2K episodes):
| # episodes | No self-play | Self-play | Mem-aug self-play |
|---|---|---|---|
| 10K | -826.31 | -778.74 | -678.14 |
| 20K | -986.14 | -712.65 | -778.31 |
| 30K | -999.24 | -949.59 | -924.37 |
| 40K | -999.98 | -999.79 | -992.29 |
| 50K | -1000.00 | -1000.00 | -996.77 |
In both domains, memory augmentation yields substantially improved early episodic rewards and converges to higher asymptotic performance.
Exploration Metrics
State-space coverage analysis demonstrates that LSTM memory substantially increases task diversity. The mean Euclidean length of (start, end) trajectories in 2D PCA state embedding grows from 0.0192 (no memory) to 0.1079 (LSTM memory), a more than fivefold increase. This quantitatively confirms that Alice with memory proposes more diverse and spatially separated tasks, promoting rapid environment coverage and more effective unsupervised curriculum generation.
Empirical ablation reveals the superiority of LSTM-based episodic memory over alternatives, as measured by convergence speed and exploration diversity in both Mazebase and Acrobot.
7. Conclusions and Prospective Extensions
Memory-Augmented Self-Play enables an intrinsically motivated agent to actively recall and diversify the task curriculum during unsupervised exploration, while preserving the min–max self-play training framework. The approach is minimally invasive, requiring only the introduction of memory conditioning in the task-generating agent’s policy inputs and training of the memory module. The primary benefits are faster convergence, richer coverage of the state space, and improved sample efficiency when pretraining Bob for arbitrary target tasks.
Potential extensions include the development of hierarchical or more sophisticated differentiable memory architectures, supporting selective persistence of trajectories and richer forms of across-episode knowledge retention (Sodhani et al., 2018).