Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 85 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 37 tok/s
GPT-5 High 37 tok/s Pro
GPT-4o 100 tok/s
GPT OSS 120B 473 tok/s Pro
Kimi K2 240 tok/s Pro
2000 character limit reached

Procgen: RL Evaluation with Procedurally Generated Environments

Updated 22 August 2025
  • Procgen is a benchmark suite of 16 procedurally generated environments designed to assess reinforcement learning algorithms on sample efficiency and generalization.
  • The evaluation protocols measure efficiency by training on full level distributions and isolate generalization by testing on unseen, procedurally varied levels.
  • Empirical findings show that increasing training diversity and scalable model architectures, including separate policy/value networks and auxiliary losses, enhance generalization.

Procgen is a benchmark suite of 16 procedurally generated environments designed to evaluate reinforcement learning (RL) algorithms on the axes of sample efficiency and generalization. Each environment presents game-like challenges—such as platforming, maze navigation, and puzzle solving—with a key emphasis on high inter- and intra-environment diversity enforced through procedural content generation. This diversity compels agents to learn policies that transcend memorization of specific trajectories and instead focus on task-relevant invariants and strategies that transfer across unseen level configurations.

1. Procedural Content Generation in Procgen

Each Procgen environment leverages procedural content generation (PCG) to synthesize distinct level instances per episode. PCG encompasses randomized algorithms governing level topology, asset assignment, enemy placement, and domain-specific attributes. Mechanisms such as cellular automata and Kruskal’s algorithm are used to generate complex maze or platform layouts. By generating a nearly unbounded set of configurations, PCG ensures that agents trained on a subset of levels face a significant generalization burden, as opposed to the finite or deterministic scenarios in benchmarks like the Arcade Learning Environment.

This variability in both visuals and dynamics prevents overfitting and allows researchers to partition level distributions, enabling the evaluation of not only sample efficiency (when trained on the full distribution) but also generalization (by training on a restricted subset and testing on held-out procedurally generated levels).

2. Evaluation Protocols and Normalized Metrics

Procgen introduces standardized protocols to quantify two primary axes: sample efficiency and generalization.

  • Sample Efficiency: Agents are trained across the full level distribution for a fixed number of timesteps (e.g., 200 million) without access to a fixed training set, typically using PPO or similar high-throughput RL methods.
  • Generalization: Agents are constrained during training to a finite set of levels (e.g., 500 on hard difficulty, 200 on easy) and are evaluated on a disjoint set sampled from the full distribution. This distinction isolates the agent’s ability to transfer.
  • Normalized Return: To enable cross-environment benchmarking, raw expected returns are normalized:

Rnorm=RRminRmaxRminR_{norm} = \frac{R - R_{min}}{R_{max} - R_{min}}

where RR is the expected return, RminR_{min} and RmaxR_{max} are empirical environment-specific bounds.

This design directly exposes the deleterious effects of overfitting and enables reproducible, scalable comparisons across methods and architectures.

3. Empirical Findings on Diversity, Overfitting, and Model Scaling

Extensive experiments demonstrate that diversity in the training set is essential for successful generalization:

  • Overfitting: Training on small sets (100–500 levels) leads to severe overfitting—agents attain high training returns yet generalize poorly to unseen levels.
  • Closing the Gap: Increasing the number of training levels (up to 10,000) significantly narrows the generalization gap, indicating that procedural diversity, rather than just environment scale, is key.
  • Deterministic Training Failure: When trained solely on a fixed set of deterministic levels (as in prior benchmarks), agents perform poorly on randomized test instances, highlighting the necessity of PCG for true generalization assessment.

Moreover, scaling architectures (notably, increasing the width of IMPALA-style convolutional backbones by factors of k=2k=2 or k=4k=4, with parameter count scaling as k2k^2) results in significant gains in both sample efficiency and generalization. Learning rate scaling by 1/k1/\sqrt{k}, as per Glorot initialization, ensures stable optimization across capacities. Smaller models often fail to make meaningful progress in several environments, underscoring the importance of sufficient representational power in high-diversity domains.

4. Architectural and Algorithmic Considerations

Procgen's unified interface standardizes both the observation space (RGB arrays of dimension 64×64×364\times 64\times 3) and the action space (15 discrete actions), enabling cross-environment hyperparameter transfer and systematic paper of architectural choices. Key findings include:

  • Separate vs. Shared Representations: Sharing policy and value representations can induce overfitting to level-specific cues (e.g., background indicating episode length). Decoupling these pathways—via distinct networks or carefully engineered auxiliary losses—improves generalization (Raileanu et al., 2021).
  • Auxiliary Losses for Representation Invariance: Inclusion of adversarial or mutual information-based losses encourages latent representation invariance to features unrelated to the underlying task, further enhancing out-of-distribution performance.
  • Sample Reuse and Training Regimes: Strategies like Phasic Policy Gradient (PPG) decouple value and policy updates across separate phases, enabling aggressive value optimization without policy overfitting (Cobbe et al., 2020).

These methodological insights guide both the design of future RL algorithms and the selection of architectural priors for environments with extensive procedural variation.

5. Implications for Reinforcement Learning Research

Procgen re-frames RL evaluation by foregrounding generalization as a first-class research objective. The benchmark:

  • Highlights that past progress on deterministic benchmarks (e.g., ALE) often reflected memorization rather than genuine task understanding.
  • Emphasizes the need for algorithms capable of learning domain-invariant features and flexible control policies that do not rely on the idiosyncrasies of training environments.
  • Establishes reproducible evaluation standards—through open-source, computationally efficient platforms—that lower barriers for empirical research into robustness, sample efficiency, and transfer.

The use of PCG as a core mechanism for enforcing generalization in RL algorithms has already catalyzed the development of new regularization schemes, exploration strategies, and robust optimization frameworks that make learning policies tractable in domains with complex, high-dimensional, and variable input distributions.

6. Directions for Future Research

Procgen's existence and open design catalyze multiple research directions:

  • Scaling Studies: As demonstrated through systematic capacity scaling, further increases in model size (with appropriate optimization and regularization) may yield new generalization breakthroughs.
  • Algorithmic Innovation: The separation of training and test regimes, along with the high throughput of the environments, enables ablation studies on auxiliary objectives, regularization strategies, and new architectures.
  • Domain Transfer Evaluation: PCG's ability to define explicit partitions of context space creates natural opportunities for studying transfer learning, meta-RL, and context-adaptive policies.
  • Curriculum and Robustness: The ability to define tunable levels of difficulty and task variation supports research into curricula, robustness to domain shift, and diagnostic benchmarking for credit assignment and exploration.

7. Broader Impact and Extensions

The procedural paradigm instantiated in Procgen is foundational for progressing toward RL agents capable of real-world competence, where the diversity and unpredictability of new scenarios require robust generalization. The subsequent extensions—such as C-Procgen with explicit context vectors and fine-grained domain control—aim to further open the “black box” of procedural generation, enabling explicit manipulation and systematic variation for advanced experimental design (Tan et al., 2023). Such progress promotes deeper understanding of RL’s strengths and limits, informing both theory and practical application in increasingly dynamic and diverse operational spaces.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (3)