Memory-Augmented Self-Play

Updated 22 December 2025

The paper introduces memory-augmented self-play, where an external memory module guides task generation to improve exploration and convergence in reinforcement learning.
It compares multiple memory encoding strategies—last episode, averaged, and LSTM—with LSTM-based memory yielding over a fivefold increase in task diversity and faster performance gains.
Empirical results in Mazebase and Acrobot show that memory augmentation enhances sample efficiency and state-space coverage, leading to superior downstream fine-tuning.

Memory-Augmented Self-Play is an extension of intrinsic motivation-based reinforcement learning wherein an agent leverages a persistent external memory to enhance the diversity and efficacy of unsupervised task generation during self-play pretraining. This enhancement equips the task-generating agent with the ability to recall and condition its current exploratory behavior on a summary of previously proposed tasks, leading to accelerated state-space exploration and superior fine-tuning performance on downstream objectives (Sodhani et al., 2018).

1. Self-Play with Memory: Formulation and Motivation

The underlying paradigm derives from the unsupervised self-play setting introduced by Sukhbaatar et al. (2017), in which two copies of an agent—denoted Alice and Bob—alternate roles: Alice constructs an implicit "task" by traversing the environment from a random initial state $s_0$ , terminating upon issuing a designated \texttt{stop} action, at which point Bob is assigned the challenge of reaching Alice's terminal state $s_T$ from the same starting configuration. Rewards are purely intrinsic; Alice is incentivized to propose tasks of intermediate difficulty, maximizing the probability that Bob fails but penalized if Bob consistently succeeds too rapidly. Bob is rewarded strictly for efficient completion.

Memory-augmented self-play modifies this architecture by introducing an external memory module $M$ associated with Alice. At episode index $k$ , Alice's policy conditions not only on the starting state $s^{\mathrm{start}}_k$ and current state $s^{\mathrm{cur}}_{k,t}$ at each time $t$ , but also on a memory encoding $m_k$ that summarizes high-level statistics or representations of all prior tasks (episodes $1,\ldots,k-1$ ). This explicit conditioning allows Alice to generate a progressively more diverse curriculum of tasks, reducing redundancy and encouraging more comprehensive state-space coverage. Bob's policy is unaltered by memory.

2. Reinforcement Learning Objectives and Training Regime

Both agents are trained using REINFORCE with a learned value-function baseline. Let Alice's policy be $\pi^A_\theta(a\mid s^{\mathrm{start}},s^{\mathrm{cur}},m)$ , and Bob's policy $\pi^B_\phi(a\mid s^{\mathrm{cur}},s^{\mathrm{target}})$ . The expected return to maximize is

$J(\theta) = \mathbb{E}_{\tau\sim\pi_\theta}\left[\sum_{t=0}^T \gamma^t r_t\right]$

where each $r_t$ is determined by the intrinsic self-play game: for Alice, a binary reward at episode termination ( $+1$ if Bob fails, $-1$ if Bob succeeds too easily); for Bob, a per-step negative penalty and a terminal success bonus. The policy gradient is given by

$\nabla_\theta J(\theta) = \mathbb{E}_{\tau}\left[\sum_{t=0}^T \nabla_\theta \ln\pi_\theta(a_t\mid s_t)\, (R_t - b(s_t))\right]$

with $R_t$ the time-to-goal return and $b(s_t)$ a baseline. The key change with memory augmentation is the input to Alice's policy network, not the training objective itself.

3. Memory Architecture and Update Mechanisms

Experimental analysis compared three memory modules for encoding prior task summaries:

Last Episode: Store only the feature vector $\phi_{\mathrm{ep}}^{k-1}$ of the immediately preceding episode.
Last $k$ Episodes (Averaged): Store the average feature vector over the most recent $k$ episodes:

$m_k = \frac{1}{k} \sum_{i=k-k+1}^{k-1}\phi_{\mathrm{ep}}^i$

LSTM Episode Memory: Process the sequence of episode summaries with an LSTM cell,

$(h_k, c_k) = \mathrm{LSTMCell}(\phi_{\mathrm{ep}}^{k-1}, (h_{k-1}, c_{k-1})), \quad m_k = h_k$

where each episode summary $\phi_{\mathrm{ep}}^i$ typically consists of a feature representation of the (start, end) state pair computed via a feed-forward network. Empirically, LSTM-based memory yielded the most rapid convergence and consistently stronger exploration.

4. Neural Architectures and Feature Representations

The core neural modules are:

Feature Extractor $\phi(\cdot)$ : Single hidden-layer feed-forward network with ReLU nonlinearity. Alice’s input: concatenated $[s^{\mathrm{start}}, s^{\mathrm{cur}}]$ ; Bob’s input: $[s^{\mathrm{cur}}, s^{\mathrm{target}}]$ .
Actor Network: Single feed-forward layer plus softmax, outputting action probabilities. For Alice, the output of the feature extractor is concatenated with memory $m$ .
Critic Network: Single feed-forward layer producing scalar value estimate $V(s)$ .

Feature dimensionality is environment-dependent:

Environment	Alice (no memory)	Alice (with memory)	Bob
Mazebase	50	100	50
Acrobot	10	20	10

The memory module augments only Alice’s input dimensionality; Bob’s feature encoder is unchanged.

5. Self-Play Training Protocol

A typical meta-episode proceeds as follows:

Sample random initial state $s_0$ .
Alice rollout: For $t=0$ to max steps, sample $a_t \sim \pi^A_\theta(a\mid s^{\mathrm{cur}}, s_0, m)$ . Terminate on \texttt{stop} or max steps; final state is $s_T$ .
Assign Bob: Task is to traverse from $s_0$ to $s_T$ .
Bob rollout: For up to max steps, penalize each time step and provide a success reward on completion.
Collect trajectories, compute intrinsic rewards, and update parameters via REINFORCE with baseline.
Extract episode summary $\phi_{\mathrm{ep}}$ and update Alice's memory: $M \leftarrow f_{\mathrm{mem}}(M, \phi_{\mathrm{ep}})$ .
Periodically, train Bob on extrinsic "target task" batches.

Experimental configurations utilized PyTorch 0.3.1, Adam optimizer (lr = $1\times 10^{-3}$ ), discount $\gamma=1$ ; environments studied included Mazebase (8×8 grid) and Acrobot (continuous OpenAI Gym task). Memory-augmented self-play notably used larger feature representations for Alice.

6. Empirical Results and Exploration Analysis

Quantitative metrics indicate memory-augmented self-play produces faster convergence and richer exploration than standard variants.

Episodic Reward Comparison

Mazebase (average episodic reward, last 10K episodes):

# episodes	No self-play	Self-play	Mem-aug self-play
100K	-4.841	-4.650	-4.580
200K	-4.419	-4.253	-3.899
300K	-3.861	-3.686	-3.130
400K	-3.357	-3.265	-2.758
500K	-3.076	-2.942	-2.541
600K	-2.782	-2.802	-2.528
700K	-2.669	-2.564	-2.516

Acrobot (average episodic reward, last 2K episodes):

# episodes	No self-play	Self-play	Mem-aug self-play
10K	-826.31	-778.74	-678.14
20K	-986.14	-712.65	-778.31
30K	-999.24	-949.59	-924.37
40K	-999.98	-999.79	-992.29
50K	-1000.00	-1000.00	-996.77

In both domains, memory augmentation yields substantially improved early episodic rewards and converges to higher asymptotic performance.

Exploration Metrics

State-space coverage analysis demonstrates that LSTM memory substantially increases task diversity. The mean Euclidean length of (start, end) trajectories in 2D PCA state embedding grows from 0.0192 (no memory) to 0.1079 (LSTM memory), a more than fivefold increase. This quantitatively confirms that Alice with memory proposes more diverse and spatially separated tasks, promoting rapid environment coverage and more effective unsupervised curriculum generation.

Empirical ablation reveals the superiority of LSTM-based episodic memory over alternatives, as measured by convergence speed and exploration diversity in both Mazebase and Acrobot.

7. Conclusions and Prospective Extensions

Memory-Augmented Self-Play enables an intrinsically motivated agent to actively recall and diversify the task curriculum during unsupervised exploration, while preserving the min–max self-play training framework. The approach is minimally invasive, requiring only the introduction of memory conditioning in the task-generating agent’s policy inputs and training of the memory module. The primary benefits are faster convergence, richer coverage of the state space, and improved sample efficiency when pretraining Bob for arbitrary target tasks.

Potential extensions include the development of hierarchical or more sophisticated differentiable memory architectures, supporting selective persistence of trajectories and richer forms of across-episode knowledge retention (Sodhani et al., 2018).

Markdown Report Issue Upgrade to Chat

References (1)

Memory Augmented Self-Play (2018)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Memory Augmented Self-Play.