Prioritized Level Replay (PLR)
- Prioritized Level Replay is a framework that selects entire environment levels based on their estimated future learning potential using TD-error metrics.
- It adapts the training curriculum by emphasizing challenging levels, thereby improving sample efficiency and reducing overfitting in procedurally generated settings.
- Empirical results on benchmarks like Procgen show PLR enhances test performance and robustness, making it compatible with many state-of-the-art reinforcement learning methods.
Prioritized Level Replay (PLR) is a curriculum-inspired experience replay framework designed to improve sample efficiency and generalization in deep reinforcement learning, especially in procedurally generated environments where each “level” or instance can vary systematically. Unlike standard uniform level sampling, PLR adaptively prioritizes the replay of environment levels based on a principled estimate of their future learning potential—typically derived from temporal-difference (TD) errors or closely related value-based metrics. By integrating priority-driven level selection into the training pipeline, PLR induces an emergent learning curriculum, focusing computation on more challenging or informative environment instances and guarding against overfitting to less critical ones.
1. Motivation and Conceptual Foundations
The classical approach to experience replay in reinforcement learning involves sampling transitions or trajectories uniformly, irrespective of their learning utility. PLR departs from this paradigm by treating entire environment instances (“levels”) as the granularity of replay and assigning to each a prioritization score reflecting the agent's potential for future policy improvement when revisiting that level. In settings with procedurally generated content, such as the Procgen Benchmark, each level is an independent roll-out with unique configuration factors. Uniform level sampling ignores the non-stationary and heterogeneous “learning signal” distributed across levels, resulting in suboptimal sample efficiency and generalization.
PLR operationalizes a curriculum-learning intuition: replaying more challenging or less-mastered levels accelerates training and leads to superior test-time generalization. This structure is distinct from Prioritized Experience Replay (PER), which prioritizes individual transitions, and from trajectory-based methods that focus on sequences, since PLR centers on per-level difficulty and policy progress (Jiang et al., 2020). The approach is general and modular, requiring minimal assumptions about the environment or agent architecture.
2. Prioritization Metrics, Scoring Functions, and Sampling
The core of PLR is the assignment of a scalar priority score S(l) to each level l. The theoretical and algorithmic literature uses several related classes of scoring functions:
- TD-Error-Based Metrics PLR typically uses the magnitude or average of TD errors accumulated during the most recent episode(s) on level l as a proxy for “learning potential”. For instance, using the Generalized Advantage Estimate (GAE), the level score may be computed as:
where is the TD error at step in the episode (with the episode length) (Jiang et al., 2020).
- Average L1 Value Loss Alternately, the average of the absolute value loss or L1 advantage can be used, functionally equivalent to the above for GAE-based agents.
- Staleness Penalty/Recency Correction To avoid over-prioritizing levels whose scores may be based on stale policies (i.e., whose priority was last updated many training iterations ago), PLR mixes the learned priority distribution with a staleness-correcting component :
with a small staleness coefficient, per-level timestamp, the current training iteration, and the set of seen levels (Jiang et al., 2020).
Practical implementations use either rank-based prioritization— with a temperature parameter—or direct proportionality to S. Proper calibration of and the replay buffer size ensures both recent sampling information and curriculum diversity.
3. Curriculum Emergence and Training Process
At the start of every episode or rollout, PLR selects a level for training by a two-stage strategy:
- If “unseen” levels remain (from a finite pool), sample uniformly among them.
- After all levels have been seen, sample levels for replay according to the current priority distribution as defined above.
Central to PLR is the use of TD-error magnitude as a proxy for a level's “future learning potential”. As the agent improves, the distribution of priority mass shifts: easier levels (lower TD error) gradually drop in replay probability and harder, more challenging levels are encountered more frequently. This produces an emergent curriculum: the allocation of training effort dynamically shifts from simple to more complex levels without manual intervention or external reward shaping.
Empirical validation in the MiniGrid and Procgen environments demonstrates this effect: over the course of training, replay probability mass transitions from bins associated with easy levels to those of hard levels (Jiang et al., 2020).
4. Empirical Performance and Generalization
PLR yields systematically improved sample efficiency and generalization performance on procedurally generated environments:
- Procgen Benchmark Across 16 procedurally generated games, PLR yields statistically significant improvements in normalized test return compared to uniform level sampling. For many games, test returns increase from a 100% baseline to the 128–176% range or higher (Jiang et al., 2020).
- Generalization Gap PLR consistently reduces the gap between training and test performance, a key measure of overfitting in non-iid RL settings.
- Combining with Other Methods PLR is compatible with and enhances the performance of other recent state-of-the-art methods, such as UCB-DrAC. Combinations further elevate normalized test returns and provide robustness across a wider suite of games.
- Transfer and Robustness On out-of-distribution tasks and zero-shot transfer settings, PLR-based curricula outperform both random and alternate active learning strategies.
These gains are achieved without architectural changes to the policy/value networks, making PLR a widely applicable module.
5. Theoretical Foundations and Economic Interpretation
Extensions and accompanying theoretical analyses provide further justification for prioritizing levels by TD-error derived proxies:
- Expected Value of Backup (EVB) and related value-based metrics are demonstrably upper-bounded by for Q-learning; for soft Q-learning, the bound is scaled by the on-policyness of the action (Li et al., 2021). This economic framing links the prioritized replay of levels to the maximum potential value improvement they admit.
- Upper and Lower Bounds For soft Q-learning with entropy regularization, experience value is bounded by the TD error weighted by policy probabilities, reinforcing the value of choosing levels with high TD error and high on-poliycness for focused replay.
These insights suggest that PLR’s score function, when chosen as (possibly weighted) average or maximal TD error over a level, is a theoretically sound proxy for learning value, and that curriculum emergence in PLR can be interpreted as the algorithm greedily targeting levels with the greatest potential for value improvement.
6. Relation to Other Prioritized Replay Techniques
PLR occupies a methodological position between several experience replay approaches:
- Compared to Prioritized Experience Replay (PER): PER prioritizes individual transitions or short sequences; PLR works at the level or episode granularity, answering a different form of the replay prioritization problem.
- Compared to Prioritized Trajectory Replay (PTR) and Sequence Replay: PLR shares the notion of non-uniform curriculum with trajectory-based prioritization schemes. Both frameworks exploit the benefit of structured, sequential data in propagating sparse credit efficiently, but PLR is specifically designed for level-based or curriculum-rich environments.
- Compared to Curriculum Generation and UED: In environments with rich procedural variation, PLR can be seen as an instantiation of “Dual Curriculum Design” (Jiang et al., 2021), where random level generators produce variation and the replay buffer “curates” by prioritizing hard or poorly performing levels for replay, explicitly connecting PLR to Unsupervised Environment Design (UED).
- Variants: PLR⊥ (PLR-perp) restricts policy updates to replayed (prioritized) levels only, omitting updates from newly drawn levels. Theoretical results show this variant converges to Nash-equilibrium robust policies in the dual curriculum game, improving zero-shot transfer and worst-case regret performance over standard PLR and domain randomization (Jiang et al., 2021).
7. Limitations, Extensions, and Future Directions
PLR as originally formulated assumes access to per-level trajectories and TD (or advantage) signals. Potential limitations and open questions include:
- Staleness and Drift: Because level priorities can become outdated as the policy evolves, the staleness correction via is critical; setting and priority update frequencies is an important practical detail.
- Continuous and Infinite Level Spaces: PLR as described in (Jiang et al., 2020) is readily applicable to finite or enumerated level spaces; further work is needed to extend it to continuous or infinite procedural variants, or to domains without clear “levels”.
- Policy/Value Overfitting and Diversity: Solely maximizing the replay of high-TD error levels may lead to overfitting or curriculum collapse; carefully balanced curriculum mixtures and occasional uniform replay remain an open area of research.
- Integration with Unsupervised Environment Design: As demonstrated in “Replay-Guided Adversarial Environment Design” (Jiang et al., 2021), future methods may combine PLR-style curation with active adversarial generation to provide robust, adaptive curricula with worst-case guarantees.
PLR’s modularity, compatibility with other policy optimization methods, and principled theoretical backing recommend it for continued use and extension in environments that require robust generalization and efficient credit propagation across diverse instance sets or curricula.