A Survey of Meta-Reinforcement Learning (2301.08028v3)

Published 19 Jan 2023 in cs.LG

Abstract: While deep reinforcement learning (RL) has fueled multiple high-profile successes in machine learning, it is held back from more widespread adoption by its often poor data efficiency and the limited generality of the policies it produces. A promising approach for alleviating these limitations is to cast the development of better RL algorithms as a machine learning problem itself in a process called meta-RL. Meta-RL is most commonly studied in a problem setting where, given a distribution of tasks, the goal is to learn a policy that is capable of adapting to any new task from the task distribution with as little data as possible. In this survey, we describe the meta-RL problem setting in detail as well as its major variations. We discuss how, at a high level, meta-RL research can be clustered based on the presence of a task distribution and the learning budget available for each individual task. Using these clusters, we then survey meta-RL algorithms and applications. We conclude by presenting the open problems on the path to making meta-RL part of the standard toolbox for a deep RL practitioner.

Authors (7)

Jacob Beck (17 papers)
Risto Vuorio (17 papers)
Evan Zheran Liu (13 papers)
Zheng Xiong (14 papers)
Luisa Zintgraf (12 papers)
Chelsea Finn (264 papers)
Shimon Whiteson (122 papers)

Citations (94)

View on Semantic Scholar

Summary

The paper proposes a meta-learning framework that trains inner-loop adaptation and outer-loop parameter updates to enhance sample efficiency.
It categorizes Meta-RL algorithms into few-shot and many-shot settings, detailing methods like parameterized policy gradients, black-box approaches, and task inference.
The study highlights applications in robotics and multi-agent systems while addressing key challenges in generalization and computational stability.

Meta-Reinforcement Learning (Meta-RL) represents a paradigm shift in developing reinforcement learning algorithms, moving from hand-designing RL methods to learning them, or parts of them, using machine learning techniques. This approach aims primarily to address the notorious sample inefficiency and limited generalization inherent in many standard deep RL algorithms. By framing RL algorithm development as a learning problem itself, Meta-RL seeks to produce agents capable of rapid adaptation to new tasks drawn from a specified distribution, leveraging experience accumulated across related tasks during a meta-training phase. The core idea involves an "outer loop" (meta-training) that learns the parameters (θ) of an "inner loop" (the adaptation process, f_θ), which in turn generates a task-specific policy (π_φ) using limited data (D) from the current task. This introduces a trade-off: increased computational and sample cost during meta-training is potentially offset by significantly improved sample efficiency during adaptation at meta-test time.

Problem Formulation and Settings

The standard Meta-RL problem assumes access to a distribution of tasks p(M), where each task M is typically a Markov Decision Process (MDP) characterized by states S, actions A, transition dynamics P, reward function R, initial state distribution P0, and horizon H. Tasks within a distribution often share S and A but differ in P, R, or P0. The meta-objective is to find meta-parameters θ that maximize the expected return achieved by the policy π_φ produced by the learned inner-loop algorithm f_θ operating on tasks sampled from p(M). Formally, considering an interaction "lifetime" or "trial" of H episodes within a task M, and potentially ignoring the first K episodes for adaptation ("burn-in"), the objective is:

$\max_{\theta} \mathbb{E}_{M \sim p(M)} \left[ \sum_{k=K}^{H-1} R_k(\phi_k) \right]$

where R_k(φ_k) is the return obtained in episode k using policy parameters φ_k, and the sequence of policy parameters φ_0, ..., φ_{H-1} is generated by the inner loop f_θ based on the data collected sequentially within the trial.

Meta-RL research can be clustered based on two primary axes: the task horizon available for adaptation (determining the "shot" count) and the nature of the task distribution.

Task Horizon:
- Few-Shot: The agent must adapt within a minimal number of episodes (e.g., 1-10). This setting emphasizes extremely rapid adaptation and is the most commonly studied.
- Many-Shot: Adaptation occurs over a significantly longer timescale (hundreds or thousands of episodes), focusing on learning more general-purpose, efficient RL algorithms rather than just fast parameter adaptation.
Task Distribution:
- Multi-Task: Meta-training explicitly utilizes a distribution p(M) encompassing multiple distinct tasks. The goal is to exploit structural similarities across tasks to accelerate adaptation to novel tasks from the same distribution.
- Single-Task: Meta-training occurs within the context of a single, potentially very complex task. The objective here is to accelerate learning within that task's lifetime by adapting to local conditions or reusing learned components, effectively treating different phases or states within the single task as the "distribution".

These axes define three predominant practical settings:

Few-Shot Multi-Task: Learn rapid adaptation strategies for new tasks from a known distribution (e.g., MAML, RL^2).
Many-Shot Multi-Task: Learn general RL algorithm components that provide efficiency over longer adaptation horizons, potentially generalizing beyond the meta-training distribution (e.g., LPG, MetaGenRL).
Many-Shot Single-Task: Accelerate learning within one complex task by meta-learning adaptable components online (e.g., STAC, FRODO).

Further variations exist, particularly within the few-shot setting, concerning the level of supervision (standard rewards, unsupervised meta-RL with rewards only at test time, unsupervised meta-testing adaptation without rewards, meta-RL via imitation learning using demonstrations), the handling of sparse rewards, and whether the approach is model-based (learning/adapting environment models) or model-free.

Algorithm Taxonomy

Meta-RL algorithms can be broadly categorized based on how they parameterize the inner-loop adaptation mechanism (f_θ) and the specific problem setting they target.

Few-Shot Meta-RL Algorithms

These algorithms are designed for rapid adaptation, often within one or a few episodes.

Parameterized Policy Gradients (PPG): These methods structure the inner loop f_θ based on components of existing policy gradient algorithms. A prominent example is Model-Agnostic Meta-Learning (MAML), where the meta-parameters θ represent the initial policy parameters (φ_0). Adaptation consists of applying one or more standard policy gradient updates within the new task. The meta-objective optimizes θ such that these few gradient steps lead to high performance. Variations meta-learn other components like learning rates, gradient preconditioning matrices, or parts of the loss function. Meta-gradient estimation (computing gradients through the inner-loop updates) is a central challenge, often addressed via approximations or specific algorithm structures. PPG methods tend to exhibit better out-of-distribution (OOD) generalization compared to black-box methods but can be less sample-efficient for extremely rapid, within-episode adaptation.
Black Box Methods: These approaches impose minimal structural assumptions on f_θ, treating it as a generic function approximator, typically a Recurrent Neural Network (RNN) like in RL^2, or increasingly, Transformers or Hypernetworks. The entire adaptation process, including exploration and policy updates, is implicitly learned and encoded within the network's hidden state or parameter generation process. These methods can learn highly specialized and efficient adaptation strategies tailored to the specific meta-training distribution, potentially adapting within a single episode. However, they often struggle with OOD generalization and can face difficult optimization landscapes (e.g., vanishing/exploding gradients in RNNs).
Task Inference Methods: These methods explicitly structure f_θ to infer properties of the current task, often represented as a latent context vector z or task embedding φ. Adaptation involves first inferring this latent variable based on initial interactions and then conditioning the policy on it. This often resembles Bayesian RL approaches, where z represents a belief state over possible task parameters. Task inference can be trained end-to-end or use auxiliary objectives related to predicting task parameters or maximizing mutual information. These methods can leverage privileged task information during meta-training if available. They are closely linked to exploration, as efficient task inference requires informative data collection.
Exploration Strategies: Effective adaptation fundamentally relies on gathering relevant information about the current task. Meta-RL methods implicitly or explicitly learn exploration strategies. Approaches include end-to-end learning (where exploration arises implicitly), posterior sampling methods (e.g., Thompson sampling adapted for meta-RL), methods leveraging task inference objectives (e.g., maximizing information gain about the latent task variable), and specific meta-learned exploration bonuses. The Bayes-Adaptive MDP (BAMDP) formalism provides a theoretical framework for optimal exploration-exploitation trade-offs in this setting, though exact solutions are often intractable.
Model-Based Meta-RL: These approaches incorporate learned models of the environment dynamics P and/or reward function R. Meta-learning can occur at the level of model parameters (adapting a base model quickly) or by learning a model adaptation procedure itself. The learned models can then be used for planning or training model-free policies, potentially improving sample efficiency.

Many-Shot Meta-RL Algorithms

These algorithms focus on learning components that improve RL efficiency over longer adaptation horizons, often within a single complex task or across broadly related tasks.

Learning Reusable Components: Rather than learning the entire adaptation algorithm, these methods often meta-learn specific parts that augment standard RL algorithms:
- Intrinsic Rewards: Meta-learning reward bonuses to guide exploration or define sub-goals/skills.
- Auxiliary Tasks: Meta-learning self-supervised objectives or their weighting to improve representation learning.
- Objective Functions: Meta-learning modifications or replacements for standard RL objectives, such as value function estimators or advantage calculations.
- Hierarchies: Meta-learning components of hierarchical RL systems, like options, skills, or manager policies.
Specialized Architectures: Developing architectures (e.g., specialized RNNs) designed for better credit assignment and generalization over long temporal dependencies encountered in many-shot settings.
Outer-Loop Optimization: Training over long inner-loop horizons poses significant optimization challenges. Techniques include using truncated surrogate objectives to manage computational cost and variance, employing actor-critic methods adapted for meta-learning, bootstrapping value estimates, or using gradient-free methods like Evolution Strategies (ES) which can bypass issues related to vanishing/exploding gradients through the inner loop.

Applications

Meta-RL has found significant application in domains where rapid adaptation or improved sample efficiency is critical.

Robotics: This is a major driver and beneficiary of Meta-RL research. Applications include enabling robots to quickly adapt manipulation skills to new objects, adjust locomotion gaits to different terrains or payloads, or transfer policies learned in simulation to the real world (sim-to-real) with minimal real-world fine-tuning. Both model-free and model-based meta-RL approaches are prevalent.
Multi-Agent Reinforcement Learning (MARL): Meta-RL addresses key MARL challenges:
- Ad Hoc Teamwork: Training agents that can quickly adapt their strategies to cooperate or compete effectively with previously unseen partners or opponents.
- Non-stationarity: Enabling agents to adapt to the changing policies of other learning agents in the environment by treating opponent/teammate strategies as part of the task distribution.
Other Domains: Meta-RL concepts have been applied to diverse areas such as adaptive traffic signal control, optimizing building energy consumption, automatic grading of programming assignments based on dynamic test generation, mitigating catastrophic forgetting in continual learning, and automating curriculum generation or unsupervised environment design.

Challenges and Open Problems

Despite progress, several significant challenges hinder the widespread adoption of Meta-RL.

Few-Shot Meta-RL Challenges:
- Generalization to Diverse Task Distributions: Current methods often excel on narrow, synthetically generated task distributions (e.g., varying parameters like goal locations or physics coefficients). Scaling to broader, more complex, and structurally diverse task distributions remains difficult. This requires developing more robust algorithms and better benchmark environments that capture realistic task variability.
- Out-of-Distribution (OOD) Generalization: A critical requirement for real-world deployment is the ability to perform reasonably and adapt efficiently even when encountering tasks significantly different from those seen during meta-training. This involves developing methods that can detect distribution shift, assess the reliability of the learned prior/adaptation strategy, and potentially fall back to more robust learning methods or request human intervention.
Many-Shot Meta-RL Challenges:
- Optimization Stability and Efficiency: Long inner-loop horizons exacerbate optimization difficulties, including vanishing/exploding gradients, high computational costs, and potential bias introduced by necessary approximations like truncated backpropagation or surrogate objectives. Addressing non-stationarity in the single-task setting also remains challenging.
- Benchmarking General RL Algorithms: Evaluating progress in learning general-purpose RL algorithms (the goal of many-shot multi-task meta-RL) is hampered by the lack of standardized benchmarks and evaluation protocols. Defining a representative distribution over "all interesting MDPs" is conceptually and practically difficult.
Leveraging Offline Data:
- Reducing the need for extensive online interaction during meta-training or adaptation is crucial. Exploring meta-RL in settings involving offline datasets (Offline meta-training / Offline adaptation, Offline / Online, Online / Offline) is an active research area.
- Key challenges include effective return estimation from fixed datasets, handling limited exploration inherent in offline data, mitigating distribution shift between the offline data policy and the policies explored during meta-training or adaptation, and the fundamental difficulty of learning effective offline RL algorithms via meta-learning.

In conclusion, Meta-RL offers a powerful framework for enhancing the adaptability and efficiency of RL agents. While significant algorithmic advances have been made, particularly in few-shot settings, major challenges remain in achieving broad generalization, robust OOD performance, stable optimization over long horizons, and effective utilization of offline data. Addressing these open problems is key to integrating Meta-RL into the standard toolkit for RL practitioners.

PDF Markdown

Related Papers

Tweets

https://twitter.com/jakeABeck/status/1908581463894741061

https://twitter.com/Montreal_AI/status/1929213718652407959

https://twitter.com/ceobillionaire/status/1929207940625412319

https://twitter.com/burny_tech/status/1881081928478232699

https://twitter.com/luisa_zintgraf/status/1909906796971450694

HackerNews

A Tutorial on Meta-Reinforcement Learning (1 point, 0 comments)