Papers
Topics
Authors
Recent
Search
2000 character limit reached

Interpolated Experience Replay (IER)

Updated 24 May 2026
  • Interpolated Experience Replay (IER) is a reinforcement learning technique that augments training using synthetic transitions generated through convex interpolation or averaging of real experiences.
  • IER encompasses various algorithmic variants (IER-D, IER-C, NMER) tailored for discrete and continuous environments to enhance data coverage and expedite convergence.
  • IER improves sample efficiency and reduces reward variance, leading to faster learning and notable performance gains in both synthetic and real-world control tasks.

Interpolated Experience Replay (IER) is an experience replay augmentation paradigm for reinforcement learning whereby synthetic transitions are constructed from interpolated combinations of existing experiences, thereby increasing data density in replay buffers while promoting sample diversity and generalization. The methodology is particularly relevant for environments with continuous or stochastic dynamics, where the observed coverage of the state–action or state–action–goal manifold is sparse. IER encompasses a family of algorithms that generate synthetic transitions via convex combinations or averaging of stored samples; implementations and theoretical motivations differ based on environment structure, neural architecture, and replay buffer organization (Sander et al., 2022, Gerken et al., 2019, 2002.01370).

1. Core Principles and Mathematical Formalism

IER methods construct new, plausible transitions by either averaging observed transitions for redundancy reduction (in discrete domains) or convexly interpolating neighboring samples (in continuous domains). The generic form of a transition in the buffer is denoted as τ=(s,a,r,s′)\tau = (s, a, r, s') or, in multi-goal RL, as t=(s,a,r,s′,g)t = (s, a, r, s', g). Two representative approaches are:

  • Averaged Transitions (Discrete IER): For each observed state–action pair (s,a)(s, a) in a stochastic discrete environment, aggregate all matching transitions and synthesize new samples with the mean reward and each unique successor state (2002.01370).

ravg(s,a)=1N∑i=1Nrir^{\mathrm{avg}}(s, a) = \frac{1}{N} \sum_{i=1}^N r_i

Synthetic transitions are (s,a,ravg,sj′)(s, a, r^{\mathrm{avg}}, s'_j) for each sj′s'_j observed under (s,a)(s, a).

  • Convex Interpolation (Continuous IER, "Mixup"): In high-dimensional or continuous domains, a synthetic transition is formed by convexly combining two transitions, e.g.,

xλ=λxi+(1−λ)xjx_\lambda = \lambda x_i + (1 - \lambda)x_j

where xi=[si;ai;ri;si′]x_i = [s_i; a_i; r_i; s_i'], xj=[sj;aj;rj;sj′]x_j = [s_j; a_j; r_j; s_j'], and t=(s,a,r,s′,g)t = (s, a, r, s', g)0. In multi-goal settings, interpolation is extended to goal components (Sander et al., 2022, Gerken et al., 2019):

t=(s,a,r,s′,g)t = (s, a, r, s', g)1

The core premise, justified via local linearity or manifold regularity assumptions, is that such mixed transitions are sufficiently close to the true support of the transition or goal-conditioned reward distribution to regularize value function estimators.

2. Algorithmic Variants

Distinct algorithms instantiate IER according to domain and buffer constraints:

  • Discrete Interpolated Experience Replay (IER-D, as in FrozenLake): Synthetic transitions are created by averaging rewards for each t=(s,a,r,s′,g)t = (s, a, r, s', g)2 pair across the replayed buffer. This low-variance average better approximates the expected return and is especially effective in nondeterministic environments. The synthetic buffer is maintained as a secondary queue and used in tandem with the primary buffer for DQN updates (2002.01370).
  • Imaginary Experience Replay (IER-C, Multi-Goal Continuous Action/State): Synthetic samples are generated by first kernel-weighting the proximity in state and goal space, then convexly interpolating all buffer components (states, actions, next states, goals), and re-computing the reward for the synthesized input. The replay buffer is densified with t=(s,a,r,s′,g)t = (s, a, r, s', g)3–t=(s,a,r,s′,g)t = (s, a, r, s', g)4 synthetic transitions per real example to improve coverage (Gerken et al., 2019).
  • Neighborhood Mixup Experience Replay (NMER): Transitions are convexly interpolated exclusively with their t=(s,a,r,s′,g)t = (s, a, r, s', g)5-nearest neighbors in t=(s,a,r,s′,g)t = (s, a, r, s', g)6-score standardized state–action space to preserve the local linearity of the unknown transition manifold. The neighbor set is dynamic and not constrained to intra-episode samples, distinguishing NMER from prior mixup methods. Interpolation parameters are drawn from a Beta distribution to control the synthetic transition's locus in the convex hull (Sander et al., 2022).

3. Neighbor and Kernel Selection Strategies

The efficacy of IER relying on local-manifold approximation is critically dependent on the mechanism for selecting sample pairs:

  • Discrete Domains: All observed transitions for a specific t=(s,a,r,s′,g)t = (s, a, r, s', g)7 are aggregated, and synthetic transitions are constructed using observed successor states and the averaged reward (2002.01370).
  • Continuous/High-dimensional Domains: Neighboring transitions are determined via t=(s,a,r,s′,g)t = (s, a, r, s', g)8-nearest neighbors in Z-score standardized feature space (NMER) or via kernelized similarities in both state and goal components (Imaginary Experience Replay) (Sander et al., 2022, Gerken et al., 2019).
  • Kernel Hyperparameters: Gaussian kernel bandwidths for selection, and Beta distribution shape parameters for the interpolation coefficients, determine whether the interpolation occurs close to endpoints or at midpoints. Bandwidths are often set to average nearest-neighbor distances in the buffer; inappropriate settings degrade performance by generating off-manifold or redundant samples (Gerken et al., 2019).

4. Algorithmic Integration and Pseudocode

IER algorithms are inserted into standard off-policy training loops as augmentation during replay-based batch sampling. The synthetic transitions are treated equivalently to real transitions in Bellman error updates or Q-learning loss minimization.

For example, in NMER (Sander et al., 2022), the batch generation protocol is:

  1. Sample t=(s,a,r,s′,g)t = (s, a, r, s', g)9 transitions from the buffer.
  2. For each, retrieve (s,a)(s, a)0 nearest neighbors in standardized state–action space.
  3. Uniformly sample one neighbor and a mixing coefficient (s,a)(s, a)1.
  4. Interpolate all transition components.
  5. Assemble the batch from synthetic samples for training.

Generic IER pseudocode, as used in multi-goal CVI (Gerken et al., 2019), is:

(s,a,ravg,sj′)(s, a, r^{\mathrm{avg}}, s'_j)3

5. Empirical Results and Sample Efficiency Impact

IER approaches have been empirically validated across diverse RL domains, showing statistically significant improvements in sample efficiency, learning speed, and final episodic reward:

  • In discrete stochastic environments (e.g. FrozenLake, 64-state, (s,a)(s, a)2), the interpolated replay algorithm achieves a relative improvement of (s,a)(s, a)3 in mean total episode reward (mean (s,a)(s, a)4 vs. (s,a)(s, a)5 baseline) and achieves threshold success rates more rapidly than standard replay, with statistical significance verified via Mann–Whitney (s,a)(s, a)6 tests ((s,a)(s, a)7) (2002.01370).
  • For multi-goal continuous-control problems, adding imaginary transitions via convex interpolation yields a (s,a)(s, a)8–(s,a)(s, a)9 convergence speed-up and ravg(s,a)=1N∑i=1Nrir^{\mathrm{avg}}(s, a) = \frac{1}{N} \sum_{i=1}^N r_i0–ravg(s,a)=1N∑i=1Nrir^{\mathrm{avg}}(s, a) = \frac{1}{N} \sum_{i=1}^N r_i1 higher final success rate compared to Hindsight Experience Replay after fixed interaction budgets. In physical robot experiments (20,000 steps), CVI+IER reached an average of ravg(s,a)=1N∑i=1Nrir^{\mathrm{avg}}(s, a) = \frac{1}{N} \sum_{i=1}^N r_i2 goals vs. ravg(s,a)=1N∑i=1Nrir^{\mathrm{avg}}(s, a) = \frac{1}{N} \sum_{i=1}^N r_i3 for plain CVI (Gerken et al., 2019).
  • NMER on continuous-control MuJoCo tasks and strong off-policy learners (TD3, SAC) demonstrates an average ravg(s,a)=1N∑i=1Nrir^{\mathrm{avg}}(s, a) = \frac{1}{N} \sum_{i=1}^N r_i4 improvement (TD3) and ravg(s,a)=1N∑i=1Nrir^{\mathrm{avg}}(s, a) = \frac{1}{N} \sum_{i=1}^N r_i5 improvement (SAC) over uniform replay after ravg(s,a)=1N∑i=1Nrir^{\mathrm{avg}}(s, a) = \frac{1}{N} \sum_{i=1}^N r_i6k interaction steps, with absolute gains such as Ant ravg(s,a)=1N∑i=1Nrir^{\mathrm{avg}}(s, a) = \frac{1}{N} \sum_{i=1}^N r_i7 and Humanoid ravg(s,a)=1N∑i=1Nrir^{\mathrm{avg}}(s, a) = \frac{1}{N} \sum_{i=1}^N r_i8 (Sander et al., 2022).
  • Learning curves in all cited works confirm faster reward accumulation and higher final returns when using interpolated or mixup-based synthetic data.

6. Theoretical Underpinnings and Limitations

All IER variants rely on either local linearity of the transition manifold or the smoothness of the value function and environment dynamics. Theoretical guarantees stem from:

  • Local Manifold Approximation: If the set of all transitions ravg(s,a)=1N∑i=1Nrir^{\mathrm{avg}}(s, a) = \frac{1}{N} \sum_{i=1}^N r_i9 is (approximately) convex or locally linear in the vicinity of observed transitions, then convex combinations fall within or close to the true support, justifying synthetic sample usage. In NMER, the deviation of the interpolated (s,a,ravg,sj′)(s, a, r^{\mathrm{avg}}, s'_j)0 from the manifold is (s,a,ravg,sj′)(s, a, r^{\mathrm{avg}}, s'_j)1, controllable via neighbor selection (Sander et al., 2022).
  • Variance Reduction: In stochastic discrete domains, averaging over reward/outcome for each (s,a,ravg,sj′)(s, a, r^{\mathrm{avg}}, s'_j)2 pair produces lower-variance target values and smoother Q-function convergence (2002.01370).

A plausible implication is that effectiveness diminishes if the local manifold assumption is violated (e.g. highly nonconvex or discontinuous environments) or if interpolation bandwidth hyperparameters are miscalibrated. Excessive synthetic data may lead to off-manifold transitions and degraded generalization if not carefully managed (Gerken et al., 2019).

IER is empirically and conceptually distinct from:

Method Synthetic Transitions Mixing Principle Notable Characteristics
Classic Uniform Replay No N/A Learns only from observed transitions
Prioritized Experience Replay (PER) No Priority by TD-error No new transitions, just reweighting
Hindsight ER (HER) No (relabel only) Replays with alternate goals Only changes goal label, not state/action
CT, S4RL Mixup Yes Adjacency or single-step pairing No geometric proximity across episodes
Naive Mixup Yes Random pairs from buffer Prone to off-manifold interpolation
NMER, IER Yes Geometric neighbor or kernel Local linearity, dynamic, cross-episode

NMER and related variants enforce geometric proximity—often across episodes and with dynamic neighbor sets—balancing regularization with on-manifold sample coverage, and empirically outperform all above baselines in sample efficiency and final policy quality (Sander et al., 2022).

References

  • (Sander et al., 2022) "Neighborhood Mixup Experience Replay: Local Convex Interpolation for Improved Sample Efficiency in Continuous Control Tasks"
  • (Gerken et al., 2019) "Continuous Value Iteration (CVI) Reinforcement Learning and Imaginary Experience Replay (IER) for learning multi-goal, continuous action and state space controllers"
  • (2002.01370) "Bootstrapping a DQN Replay Memory with Synthetic Experiences"

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Interpolated Experience Replay (IER).