Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Survey of Zero-shot Generalisation in Deep Reinforcement Learning (2111.09794v6)

Published 18 Nov 2021 in cs.LG and cs.AI

Abstract: The study of zero-shot generalisation (ZSG) in deep Reinforcement Learning (RL) aims to produce RL algorithms whose policies generalise well to novel unseen situations at deployment time, avoiding overfitting to their training environments. Tackling this is vital if we are to deploy reinforcement learning algorithms in real world scenarios, where the environment will be diverse, dynamic and unpredictable. This survey is an overview of this nascent field. We rely on a unifying formalism and terminology for discussing different ZSG problems, building upon previous works. We go on to categorise existing benchmarks for ZSG, as well as current methods for tackling these problems. Finally, we provide a critical discussion of the current state of the field, including recommendations for future work. Among other conclusions, we argue that taking a purely procedural content generation approach to benchmark design is not conducive to progress in ZSG, we suggest fast online adaptation and tackling RL-specific problems as some areas for future work on methods for ZSG, and we recommend building benchmarks in underexplored problem settings such as offline RL ZSG and reward-function variation.

Citations (123)

Summary

  • The paper establishes a unifying formalism for zero-shot generalization in DRL by extending Contextual Markov Decision Processes to evaluate agents in unseen environments.
  • It categorizes benchmarks—including dynamics, reward, and observation variations—and reviews methodologies such as data augmentation, regularization, and ensemble methods.
  • The survey critically analyzes current limitations and advocates for structured benchmarks and RL-specific techniques to guide future research in robust DRL deployment.

This survey provides a comprehensive overview of the field of Zero-Shot Generalization (ZSG) in Deep Reinforcement Learning (DRL), addressing the critical challenge of training agents capable of performing effectively in novel, unseen environments at deployment time without additional learning (A Survey of Zero-shot Generalisation in Deep Reinforcement Learning, 2021). The work establishes a unifying formalism, categorizes benchmarks and methods, and offers a critical perspective on the state of the art and future research trajectories.

Formalism and Problem Definition

The survey grounds the ZSG problem within the framework of Contextual Markov Decision Processes (CMDPs), extending the standard MDP definition. A CMDP is defined as a tuple M=(S,A,P,R,γ,μ0,C,pC)\mathcal{M} = (\mathcal{S}, \mathcal{A}, \mathcal{P}, \mathcal{R}, \gamma, \mu_0, \mathcal{C}, p_C), where:

  • S\mathcal{S} is the state space.
  • A\mathcal{A} is the action space.
  • P:S×A×S×C[0,1]\mathcal{P}: \mathcal{S} \times \mathcal{A} \times \mathcal{S} \times \mathcal{C} \rightarrow [0, 1] is the context-dependent transition probability function.
  • R:S×A×S×CR\mathcal{R}: \mathcal{S} \times \mathcal{A} \times \mathcal{S} \times \mathcal{C} \rightarrow \mathbb{R} is the context-dependent reward function.
  • γ[0,1)\gamma \in [0, 1) is the discount factor.
  • μ0:S×C[0,1]\mu_0: \mathcal{S} \times \mathcal{C} \rightarrow [0, 1] is the context-dependent initial state distribution.
  • C\mathcal{C} is the space of possible contexts (or environment parameters).
  • pCp_C is a distribution over the context space C\mathcal{C}.

In the ZSG setting, an agent is trained on a set of contexts sampled from a training distribution pC,trainp_{C, train} and subsequently evaluated on a set of contexts sampled from a test distribution pC,testp_{C, test}. The core challenge arises because the support of the test distribution is typically disjoint from, or extends beyond, the support of the training distribution, i.e., supp(pC,test)⊈supp(pC,train)supp(p_{C, test}) \not\subseteq supp(p_{C, train}). The objective is to learn a policy π:SA\pi: \mathcal{S} \rightarrow \mathcal{A} (or π:S×CA\pi: \mathcal{S} \times \mathcal{C} \rightarrow \mathcal{A} if context is observable) that maximizes the expected return across the test distribution:

J(π)=EcpC,test[Eτp(τπ,c)[t=0γtR(st,at,st+1,c)]]J(\pi) = \mathbb{E}_{c \sim p_{C, test}} \left[ \mathbb{E}_{\tau \sim p(\tau | \pi, c)} \left[ \sum_{t=0}^{\infty} \gamma^t R(s_t, a_t, s_{t+1}, c) \right] \right]

where τ=(s0,a0,s1,a1,)\tau = (s_0, a_0, s_1, a_1, \dots) represents a trajectory generated by policy π\pi in context cc. Evaluation often focuses on the average performance across pC,testp_{C, test}, although worst-case performance over the test contexts is also a relevant, albeit more challenging, metric.

Benchmark Categorization

The survey systematically categorizes existing ZSG benchmarks based on the source of variation between training and testing environments. Key categories include:

  1. Dynamics Variation: Changes in the transition function P\mathcal{P}. Examples include variations in physical parameters (mass, friction) in MuJoCo control tasks or modified game physics in Atari.
  2. Reward Variation: Alterations in the reward function R\mathcal{R}. This is less explored but critical for real-world deployment where objectives might change.
  3. Observation Space Variation: Modifications to how the state S\mathcal{S} is perceived by the agent, often through visual changes (e.g., backgrounds, textures, lighting). Domain randomization techniques often target this.
  4. Initial State Distribution Variation: Changes in μ0\mu_0, affecting the starting conditions of episodes.
  5. Goal Variation: Changes in the desired goal state or configuration, often linked to reward variations.

A significant portion of ZSG research utilizes benchmarks based on Procedural Content Generation (PCG), such as the Procgen Benchmark. While PCG allows for the generation of a large number of diverse environments, the survey critiques an over-reliance on purely PCG-based benchmarks. The argument posits that unstructured variations generated by PCG can make it difficult to isolate specific generalization challenges, measure progress systematically, or understand the underlying factors contributing to generalization failures. Benchmarks with more structured, controllable axes of variation are advocated for more insightful analysis. Examples discussed include variations of standard RL benchmarks like CartPole, MuJoCo suites with parameterized physics, and visual variations applied to Atari or DeepMind Control Suite tasks.

Methodologies for Zero-Shot Generalization

Existing methods for improving ZSG in DRL often draw inspiration from techniques used in supervised learning for domain generalization, but increasingly, RL-specific approaches are being developed. The survey categorizes methods as follows:

  1. Data Augmentation: Techniques applied to the agent's observations or experiences to expose it to wider variations during training.
    • Visual Augmentation: Methods like random convolutions, color jittering, noise injection, cropping (e.g., RAD, DrQ) applied to image-based observations. Domain Randomization (DR) is a prominent example where visual parameters (textures, lighting, camera position) are randomized during training.
    • Dynamics Randomization: Systematically varying simulation parameters (mass, friction, forces) during training to promote robustness to physical variations.
  2. Regularization Techniques: Methods imposing constraints or penalties during training to prevent overfitting to the training environments.
    • Standard techniques: L1/L2 regularization, dropout, batch normalization.
    • RL-specific regularization: Entropy maximization (as in SAC), information bottlenecks (e.g., IBAC-R) constraining the information flow between state and action/representation, or methods promoting policy smoothness.
  3. Architectural Modifications: Designing network architectures inherently suited for generalization.
    • Modular architectures where different components handle specific aspects of the task or environment.
    • Attention mechanisms to focus on relevant parts of the state representation.
    • Recurrent architectures (LSTMs, GRUs) to potentially capture temporal dynamics or implicitly infer context.
  4. Ensemble Methods: Training multiple policies and combining their predictions, often improving robustness and generalization. Averaging policy outputs or Q-values are common strategies.
  5. Representation Learning: Focusing on learning state representations that are invariant to nuisance variations (e.g., visual changes) while retaining task-relevant information. Disentanglement aims to separate factors of variation in the learned representation. Methods often employ auxiliary losses or specific architectures to encourage desired representational properties.

The survey notes that many successful approaches combine multiple techniques, such as using data augmentation alongside specific regularization methods.

Critical Analysis and Future Directions

The survey presents a critical assessment of the ZSG field in DRL, identifying limitations and suggesting promising avenues for future research.

  • Critique of PCG Benchmarks: As mentioned, the reliance on purely PCG benchmarks is questioned due to the unstructured nature of the variations, making systematic analysis difficult. The need for benchmarks with controlled, interpretable axes of variation (e.g., varying specific physical parameters, object properties, or clear visual factors) is emphasized to better understand generalization capabilities.
  • Beyond Supervised Learning Techniques: While techniques like data augmentation borrowed from supervised learning have proven useful, the survey argues for developing more RL-specific methods. ZSG in RL presents unique challenges related to sequential decision-making, credit assignment, and exploration under varying dynamics or rewards, which may require fundamentally different approaches.
  • Fast Online Adaptation: While the focus is ZSG (no adaptation at test time), the potential for fast online adaptation (adapting within a few interactions in the test environment) is highlighted as a practical and promising direction, bridging the gap between pure ZSG and traditional fine-tuning. Techniques like meta-learning could be relevant here, although requiring careful framing to distinguish from few-shot learning.
  • Offline RL ZSG: Applying ZSG principles to the offline RL setting (learning from fixed datasets) is identified as an underexplored but important area, particularly for real-world applications where online interaction is limited or costly. Generalizing from a fixed dataset to unseen dynamics or rewards poses significant challenges.
  • Reward Function Variation: Generalization across different reward functions remains a relatively unexplored area compared to dynamics or observation variations. Developing benchmarks and methods specifically targeting reward generalization is crucial for deploying agents in scenarios with evolving objectives.
  • Theoretical Understanding: Deeper theoretical analysis of why certain methods improve ZSG in DRL, and the fundamental limits of generalization given specific training distributions and environment classes, is needed.

Conclusion

The survey (A Survey of Zero-shot Generalisation in Deep Reinforcement Learning, 2021) provides a valuable synthesis of the emerging field of zero-shot generalization in deep reinforcement learning. By establishing a common formalism, reviewing existing benchmarks and methods, and offering a critical perspective, it lays the groundwork for more structured research in this vital area. The emphasis on moving beyond purely PCG benchmarks towards more controlled variation, exploring RL-specific generalization techniques, and tackling under-explored areas like offline RL ZSG and reward variation highlights key challenges and opportunities for enabling the deployment of DRL agents in diverse and unpredictable real-world environments.