Papers
Topics
Authors
Recent
Search
2000 character limit reached

Memory-Augmented Self-Play

Updated 22 December 2025
  • The paper introduces memory-augmented self-play, where an external memory module guides task generation to improve exploration and convergence in reinforcement learning.
  • It compares multiple memory encoding strategies—last episode, averaged, and LSTM—with LSTM-based memory yielding over a fivefold increase in task diversity and faster performance gains.
  • Empirical results in Mazebase and Acrobot show that memory augmentation enhances sample efficiency and state-space coverage, leading to superior downstream fine-tuning.

Memory-Augmented Self-Play is an extension of intrinsic motivation-based reinforcement learning wherein an agent leverages a persistent external memory to enhance the diversity and efficacy of unsupervised task generation during self-play pretraining. This enhancement equips the task-generating agent with the ability to recall and condition its current exploratory behavior on a summary of previously proposed tasks, leading to accelerated state-space exploration and superior fine-tuning performance on downstream objectives (Sodhani et al., 2018).

1. Self-Play with Memory: Formulation and Motivation

The underlying paradigm derives from the unsupervised self-play setting introduced by Sukhbaatar et al. (2017), in which two copies of an agent—denoted Alice and Bob—alternate roles: Alice constructs an implicit "task" by traversing the environment from a random initial state s0s_0, terminating upon issuing a designated \texttt{stop} action, at which point Bob is assigned the challenge of reaching Alice's terminal state sTs_T from the same starting configuration. Rewards are purely intrinsic; Alice is incentivized to propose tasks of intermediate difficulty, maximizing the probability that Bob fails but penalized if Bob consistently succeeds too rapidly. Bob is rewarded strictly for efficient completion.

Memory-augmented self-play modifies this architecture by introducing an external memory module MM associated with Alice. At episode index kk, Alice's policy conditions not only on the starting state skstarts^{\mathrm{start}}_k and current state sk,tcurs^{\mathrm{cur}}_{k,t} at each time tt, but also on a memory encoding mkm_k that summarizes high-level statistics or representations of all prior tasks (episodes 1,,k11,\ldots,k-1). This explicit conditioning allows Alice to generate a progressively more diverse curriculum of tasks, reducing redundancy and encouraging more comprehensive state-space coverage. Bob's policy is unaltered by memory.

2. Reinforcement Learning Objectives and Training Regime

Both agents are trained using REINFORCE with a learned value-function baseline. Let Alice's policy be πθA(asstart,scur,m)\pi^A_\theta(a\mid s^{\mathrm{start}},s^{\mathrm{cur}},m), and Bob's policy πϕB(ascur,starget)\pi^B_\phi(a\mid s^{\mathrm{cur}},s^{\mathrm{target}}). The expected return to maximize is

J(θ)=Eτπθ[t=0Tγtrt]J(\theta) = \mathbb{E}_{\tau\sim\pi_\theta}\left[\sum_{t=0}^T \gamma^t r_t\right]

where each rtr_t is determined by the intrinsic self-play game: for Alice, a binary reward at episode termination (+1+1 if Bob fails, 1-1 if Bob succeeds too easily); for Bob, a per-step negative penalty and a terminal success bonus. The policy gradient is given by

θJ(θ)=Eτ[t=0Tθlnπθ(atst)(Rtb(st))]\nabla_\theta J(\theta) = \mathbb{E}_{\tau}\left[\sum_{t=0}^T \nabla_\theta \ln\pi_\theta(a_t\mid s_t)\, (R_t - b(s_t))\right]

with RtR_t the time-to-goal return and b(st)b(s_t) a baseline. The key change with memory augmentation is the input to Alice's policy network, not the training objective itself.

3. Memory Architecture and Update Mechanisms

Experimental analysis compared three memory modules for encoding prior task summaries:

  • Last Episode: Store only the feature vector ϕepk1\phi_{\mathrm{ep}}^{k-1} of the immediately preceding episode.
  • Last kk Episodes (Averaged): Store the average feature vector over the most recent kk episodes:

mk=1ki=kk+1k1ϕepim_k = \frac{1}{k} \sum_{i=k-k+1}^{k-1}\phi_{\mathrm{ep}}^i

  • LSTM Episode Memory: Process the sequence of episode summaries with an LSTM cell,

(hk,ck)=LSTMCell(ϕepk1,(hk1,ck1)),mk=hk(h_k, c_k) = \mathrm{LSTMCell}(\phi_{\mathrm{ep}}^{k-1}, (h_{k-1}, c_{k-1})), \quad m_k = h_k

where each episode summary ϕepi\phi_{\mathrm{ep}}^i typically consists of a feature representation of the (start, end) state pair computed via a feed-forward network. Empirically, LSTM-based memory yielded the most rapid convergence and consistently stronger exploration.

4. Neural Architectures and Feature Representations

The core neural modules are:

  • Feature Extractor ϕ()\phi(\cdot): Single hidden-layer feed-forward network with ReLU nonlinearity. Alice’s input: concatenated [sstart,scur][s^{\mathrm{start}}, s^{\mathrm{cur}}]; Bob’s input: [scur,starget][s^{\mathrm{cur}}, s^{\mathrm{target}}].
  • Actor Network: Single feed-forward layer plus softmax, outputting action probabilities. For Alice, the output of the feature extractor is concatenated with memory mm.
  • Critic Network: Single feed-forward layer producing scalar value estimate V(s)V(s).

Feature dimensionality is environment-dependent:

Environment Alice (no memory) Alice (with memory) Bob
Mazebase 50 100 50
Acrobot 10 20 10

The memory module augments only Alice’s input dimensionality; Bob’s feature encoder is unchanged.

5. Self-Play Training Protocol

A typical meta-episode proceeds as follows:

  1. Sample random initial state s0s_0.
  2. Alice rollout: For t=0t=0 to max steps, sample atπθA(ascur,s0,m)a_t \sim \pi^A_\theta(a\mid s^{\mathrm{cur}}, s_0, m). Terminate on \texttt{stop} or max steps; final state is sTs_T.
  3. Assign Bob: Task is to traverse from s0s_0 to sTs_T.
  4. Bob rollout: For up to max steps, penalize each time step and provide a success reward on completion.
  5. Collect trajectories, compute intrinsic rewards, and update parameters via REINFORCE with baseline.
  6. Extract episode summary ϕep\phi_{\mathrm{ep}} and update Alice's memory: Mfmem(M,ϕep)M \leftarrow f_{\mathrm{mem}}(M, \phi_{\mathrm{ep}}).
  7. Periodically, train Bob on extrinsic "target task" batches.

Experimental configurations utilized PyTorch 0.3.1, Adam optimizer (lr = 1×1031\times 10^{-3}), discount γ=1\gamma=1; environments studied included Mazebase (8×8 grid) and Acrobot (continuous OpenAI Gym task). Memory-augmented self-play notably used larger feature representations for Alice.

6. Empirical Results and Exploration Analysis

Quantitative metrics indicate memory-augmented self-play produces faster convergence and richer exploration than standard variants.

Episodic Reward Comparison

Mazebase (average episodic reward, last 10K episodes):

# episodes No self-play Self-play Mem-aug self-play
100K -4.841 -4.650 -4.580
200K -4.419 -4.253 -3.899
300K -3.861 -3.686 -3.130
400K -3.357 -3.265 -2.758
500K -3.076 -2.942 -2.541
600K -2.782 -2.802 -2.528
700K -2.669 -2.564 -2.516

Acrobot (average episodic reward, last 2K episodes):

# episodes No self-play Self-play Mem-aug self-play
10K -826.31 -778.74 -678.14
20K -986.14 -712.65 -778.31
30K -999.24 -949.59 -924.37
40K -999.98 -999.79 -992.29
50K -1000.00 -1000.00 -996.77

In both domains, memory augmentation yields substantially improved early episodic rewards and converges to higher asymptotic performance.

Exploration Metrics

State-space coverage analysis demonstrates that LSTM memory substantially increases task diversity. The mean Euclidean length of (start, end) trajectories in 2D PCA state embedding grows from 0.0192 (no memory) to 0.1079 (LSTM memory), a more than fivefold increase. This quantitatively confirms that Alice with memory proposes more diverse and spatially separated tasks, promoting rapid environment coverage and more effective unsupervised curriculum generation.

Empirical ablation reveals the superiority of LSTM-based episodic memory over alternatives, as measured by convergence speed and exploration diversity in both Mazebase and Acrobot.

7. Conclusions and Prospective Extensions

Memory-Augmented Self-Play enables an intrinsically motivated agent to actively recall and diversify the task curriculum during unsupervised exploration, while preserving the min–max self-play training framework. The approach is minimally invasive, requiring only the introduction of memory conditioning in the task-generating agent’s policy inputs and training of the memory module. The primary benefits are faster convergence, richer coverage of the state space, and improved sample efficiency when pretraining Bob for arbitrary target tasks.

Potential extensions include the development of hierarchical or more sophisticated differentiable memory architectures, supporting selective persistence of trajectories and richer forms of across-episode knowledge retention (Sodhani et al., 2018).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Memory Augmented Self-Play.