Papers
Topics
Authors
Recent
Search
2000 character limit reached

Double-prioritized State Recycling (DPSR)

Updated 4 June 2026
  • Double-prioritized State Recycling is a reinforcement learning strategy that integrates prioritized sampling and prioritized replacement to maintain high-quality, up-to-date transitions.
  • It introduces state recycling to refresh stale transitions, thereby improving sample efficiency and achieving superior performance on Atari and classic control benchmarks.
  • The method incorporates importance-sampling corrections and controlled hyperparameter annealing to mitigate bias, ensuring robust learning progress in Deep Q-Networks.

Double-prioritized State Recycling (DPSR) is a reinforcement learning experience replay strategy that augments Deep Q-Networks (DQN) by integrating two axes of prioritization—sampling and replacement—as well as an explicit state-recycling mechanism. DPSR aims to ensure that a replay buffer contains transitions that are both highly informative and up-to-date with respect to the agent’s current policy, addressing limitations in prior experience replay methods such as uniform or prioritized experience replay (PER). Empirical evaluation demonstrates that DPSR significantly improves sample efficiency and ultimate performance on a broad set of Atari games and classic control tasks, outperforming both uniform and PER baselines (Bu et al., 2020).

1. Principle and Motivation

In standard DQN with uniform replay, an agent’s experience buffer B\mathcal{B} contains past transitions (si,ai,ri,si)(s_i, a_i, r_i, s'_i), sampled and replayed uniformly with probability 1/B1/|\mathcal{B}|. Prioritized Experience Replay (PER) advances this by sampling transitions with probability proportional to their TD error, thereby concentrating updates on more “surprising” or important experiences. However, PER’s FIFO buffer replacement policy ignores the value of old transitions whose learning potential may persist, and overlooks the possibility that transitions initially deemed unimportant could become relevant following policy updates. DPSR addresses both issues by (a) employing double prioritization—prioritized sampling and prioritized replacement—and (b) refreshing “stale” or mis-prioritized transitions via state recycling.

Double prioritization involves:

  • Prioritized Sampling: Transitions are sampled for gradient updates according to their TD error, with the sampling exponent α(t)\alpha(t) annealed over time.
  • Prioritized Replacement: When the buffer reaches capacity, transitions to be evicted are selected with probability favoring low-priority (low TD error) entries, controlled by an exponent γ(t)\gamma(t).

State recycling periodically rejuvenates a subset of low-priority transitions by simulating from their stored states with the current policy, thereby repopulating the buffer with more informative transitions in regions that may have been stale or under-represented.

2. Mathematical Model

The replay buffer is formalized as B={Ti=(si,ai,ri,si,pi,ti)}i=1N\mathcal{B} = \{\mathcal{T}_i = (s_i, a_i, r_i, s'_i, p_i, t_i)\}_{i=1}^N, storing each transition's state, action, reward, next state, current priority, and insertion timestep.

2.1. Priority Assignment:

Transition Ti\mathcal{T}_i receives proportional priority: pi=δi+εp_i = |\delta_i| + \varepsilon where δi=ri+Qtarget(si,  argmaxaQ(si,a))Q(si,ai)\delta_i = r_i + Q_{\text{target}}(s'_i,\;\arg\max_a Q(s'_i,a)) - Q(s_i,a_i), and ε>0\varepsilon>0 ensures nonzero probability for all entries.

2.2. Sampling Probability:

At step (si,ai,ri,si)(s_i, a_i, r_i, s'_i)0, the sampling probability for transition (si,ai,ri,si)(s_i, a_i, r_i, s'_i)1 is

(si,ai,ri,si)(s_i, a_i, r_i, s'_i)2

with (si,ai,ri,si)(s_i, a_i, r_i, s'_i)3.

2.3. Replacement Probability:

When (si,ai,ri,si)(s_i, a_i, r_i, s'_i)4 is full, (si,ai,ri,si)(s_i, a_i, r_i, s'_i)5 candidate transitions are drawn for eviction according to

(si,ai,ri,si)(s_i, a_i, r_i, s'_i)6

where (si,ai,ri,si)(s_i, a_i, r_i, s'_i)7 biases selection towards low-error (low-priority) entries; eviction among candidates proceeds via oldest-first.

2.4. State Recycling:

Every (si,ai,ri,si)(s_i, a_i, r_i, s'_i)8 steps, rather than standard replacement, (si,ai,ri,si)(s_i, a_i, r_i, s'_i)9 candidates are drawn via 1/B1/|\mathcal{B}|0. For each index 1/B1/|\mathcal{B}|1:

  1. Retrieve 1/B1/|\mathcal{B}|2, discard 1/B1/|\mathcal{B}|3.
  2. Compute current greedy action 1/B1/|\mathcal{B}|4; if 1/B1/|\mathcal{B}|5, force a random 1/B1/|\mathcal{B}|6.
  3. Advance 1/B1/|\mathcal{B}|7 in the environment to observe 1/B1/|\mathcal{B}|8.
  4. Construct transition 1/B1/|\mathcal{B}|9 and assign α(t)\alpha(t)0.

Of these α(t)\alpha(t)1 recycled transitions, the entry with the lowest α(t)\alpha(t)2 replaces its predecessor.

2.5. New Transitions:

A new transition is assigned α(t)\alpha(t)3 for immediate eligibility in training.

2.6. Importance-Sampling Correction:

To correct for nonuniform sampling, an importance weight is applied: α(t)\alpha(t)4 with normalization α(t)\alpha(t)5 and α(t)\alpha(t)6 annealed to 1 during training.

3. Algorithmic Realization

The DPSR variant of DQN is structured as follows:

Ti\mathcal{T}_i1

Hyperparameters are tuned per-task, with typical values: minibatch size α(t)\alpha(t)7, learning rate α(t)\alpha(t)8, replay capacity α(t)\alpha(t)9, γ(t)\gamma(t)0, γ(t)\gamma(t)1, γ(t)\gamma(t)2, γ(t)\gamma(t)3, γ(t)\gamma(t)4, γ(t)\gamma(t)5, γ(t)\gamma(t)6, γ(t)\gamma(t)7, γ(t)\gamma(t)8 (Bu et al., 2020).

4. Theoretical Properties

No convergence proofs are provided for DPSR. However, the introduction of prioritized sampling (γ(t)\gamma(t)9) and prioritized replacement (B={Ti=(si,ai,ri,si,pi,ti)}i=1N\mathcal{B} = \{\mathcal{T}_i = (s_i, a_i, r_i, s'_i, p_i, t_i)\}_{i=1}^N0) alters the empirical data distribution, introducing bias that is mitigated through importance-sampling weight annealing (B={Ti=(si,ai,ri,si,pi,ti)}i=1N\mathcal{B} = \{\mathcal{T}_i = (s_i, a_i, r_i, s'_i, p_i, t_i)\}_{i=1}^N1). State recycling reduces estimation variance and bias by refreshing low-priority entries under newer policies, preventing neglect of state–action regions whose value may not be accurately reflected by outdated TD errors. Empirically, these mechanisms collectively yield 1.4–2× improvements in sample efficiency on Atari benchmarks (Bu et al., 2020).

5. Empirical Evaluation

DPSR is evaluated on CartPole-v0 and 24 Atari 2600 games (NoFrameskip-v4). In these benchmarks, DPSR achieves:

  • Outperformance versus uniform replay (“Original”) in 23/24 games (“gold”), and 1/24 (“silver”).
  • Mean score improvement vs. Original: +161.1%; vs. PER: +137.1%.
  • Median improvement vs. Original: +87.0%; vs. PER: +92.1%.

Representative scores:

Game Original PER DPSR
Breakout 87.0 136.4 281.4
Freeway 30.2 29.5 32.2
VideoPinball 7313.0 12025.1 51993.2

These improvements are robust across parameter choices and replay buffer sizes.

6. Practical Ramifications and Limitations

DPSR incurs additional computational and memory overhead. State recycling requires simulator steps for up to B={Ti=(si,ai,ri,si,pi,ti)}i=1N\mathcal{B} = \{\mathcal{T}_i = (s_i, a_i, r_i, s'_i, p_i, t_i)\}_{i=1}^N2 candidates every B={Ti=(si,ai,ri,si,pi,ti)}i=1N\mathcal{B} = \{\mathcal{T}_i = (s_i, a_i, r_i, s'_i, p_i, t_i)\}_{i=1}^N3 steps and the buffer must support snapshotting and restoring environment states. In practice, with B={Ti=(si,ai,ri,si,pi,ti)}i=1N\mathcal{B} = \{\mathcal{T}_i = (s_i, a_i, r_i, s'_i, p_i, t_i)\}_{i=1}^N4 and B={Ti=(si,ai,ri,si,pi,ti)}i=1N\mathcal{B} = \{\mathcal{T}_i = (s_i, a_i, r_i, s'_i, p_i, t_i)\}_{i=1}^N5, overhead remains modest.

The necessity of storing full state snapshots may be ameliorated via state compression (e.g., latent-feature replay), but this is not explored in (Bu et al., 2020). The bias introduced by double prioritization is manageable if importance-weights (B={Ti=(si,ai,ri,si,pi,ti)}i=1N\mathcal{B} = \{\mathcal{T}_i = (s_i, a_i, r_i, s'_i, p_i, t_i)\}_{i=1}^N6) are annealed strongly; otherwise, disabling recycling (B={Ti=(si,ai,ri,si,pi,ti)}i=1N\mathcal{B} = \{\mathcal{T}_i = (s_i, a_i, r_i, s'_i, p_i, t_i)\}_{i=1}^N7) can lead to inferior performance compared to PER due to unchecked bias. In highly stochastic or near-deterministic domains, recycling may revive noisy transitions, requiring careful tuning of B={Ti=(si,ai,ri,si,pi,ti)}i=1N\mathcal{B} = \{\mathcal{T}_i = (s_i, a_i, r_i, s'_i, p_i, t_i)\}_{i=1}^N8, B={Ti=(si,ai,ri,si,pi,ti)}i=1N\mathcal{B} = \{\mathcal{T}_i = (s_i, a_i, r_i, s'_i, p_i, t_i)\}_{i=1}^N9, and Ti\mathcal{T}_i0.

DPSR systematically enriches the replay buffer by favoring high-error transitions during sampling, evicting low-error ones in replacement, and rejuvenating stale entries through recycling. This orchestrated approach delivers substantial gains over existing methods on established benchmarks (Bu et al., 2020).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Double-prioritized State Recycling (DPSR).