Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables (1903.08254v1)

Published 19 Mar 2019 in cs.LG, cs.AI, and stat.ML

Abstract: Deep reinforcement learning algorithms require large amounts of experience to learn an individual task. While in principle meta-reinforcement learning (meta-RL) algorithms enable agents to learn new skills from small amounts of experience, several major challenges preclude their practicality. Current methods rely heavily on on-policy experience, limiting their sample efficiency. The also lack mechanisms to reason about task uncertainty when adapting to new tasks, limiting their effectiveness in sparse reward problems. In this paper, we address these challenges by developing an off-policy meta-RL algorithm that disentangles task inference and control. In our approach, we perform online probabilistic filtering of latent task variables to infer how to solve a new task from small amounts of experience. This probabilistic interpretation enables posterior sampling for structured and efficient exploration. We demonstrate how to integrate these task variables with off-policy RL algorithms to achieve both meta-training and adaptation efficiency. Our method outperforms prior algorithms in sample efficiency by 20-100X as well as in asymptotic performance on several meta-RL benchmarks.

PDF Abstract

Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables

The paper "Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables" addresses significant challenges in the domain of meta-reinforcement learning (meta-RL), particularly focusing on enhancing sample efficiency and effective adaptation. The proposed algorithm, Probabilistic Embeddings for Actor-Critic RL (PEARL), presents a methodological advancement in disentangling task inference from control through a novel framework that exploits probabilistic context variables.

Key Contributions

Off-Policy Meta-RL Framework: The paper introduces an off-policy meta-RL algorithm that significantly improves sample efficiency during meta-training. By separating task inference and control, it circumvents the inefficiencies associated with on-policy data, which are typical of existing meta-RL methods.
Probabilistic Context Variables: The core innovation lies in the probabilistic modeling of latent task variables. This facilitates posterior sampling, which in turn enables structured exploration and rapid adaptation to new tasks. The probabilistic interpretation provides a mechanism for managing task uncertainty effectively, especially in sparse reward environments.
Improved Sample Efficiency: PEARL demonstrates a substantial improvement in sample efficiency, achieving 20-100x better performance compared to prior methods. This is evidenced through evaluations on several continuous control tasks, where the algorithm outperforms existing methods in both sample efficiency and asymptotic performance.

Methodological Insights

Latent Context Modeling: The use of a probabilistic latent context, represented as $Z$ , allows the formulation of a structured belief over tasks. This latent variable is inferred from past experience, enabling the policy to adapt efficiently to new tasks.
Off-Policy Integration: The method integrates with off-policy RL algorithms, such as soft actor-critic (SAC), by leveraging a flexible data sampling strategy. Context and RL batches are sampled separately, thus maintaining a balance between efficiency and performance.
Posterior Sampling for Exploration: By sampling context variables from the posterior distribution, the algorithm supports temporally extended exploration strategies. This probabilistic approach allows the agent to explore new tasks effectively, a critical feature in environments with sparse rewards.

Experimental Evaluation

The experimental evaluation highlights PEARL's superior performance across a suite of continuous control tasks. Key metrics include:

Sample Efficiency: The algorithm exhibits 20-100x improvement in training samples required for convergence.
Asymptotic Performance: It achieves higher or equal performance compared to baselines on benchmark tasks such as Half-Cheetah and Ant environments.
Exploration in Sparse Rewards: In a 2D navigation task with sparse rewards, PEARL effectively uses posterior sampling to explore and adapt quickly.

Implications and Future Directions

The capability of PEARL to disentangle task inference from control through probabilistic modeling opens avenues for more efficient meta-learning techniques in RL. The approach could be extended to more complex, real-world scenarios where rapid adaptation and efficient learning are crucial.

Future research could explore the integration of PEARL with hierarchical reinforcement learning frameworks, leveraging its probabilistic task inference mechanism to improve planning and decision-making in complex environments.

Conclusion

The paper presents a significant methodological advancement in meta-RL with the PEARL algorithm, addressing the core challenges of sample efficiency and task adaptability. By utilizing probabilistic context variables, it sets a new benchmark in off-policy meta-learning, promoting efficient exploration and rapid learning in complex RL environments.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Kate Rakelly (6 papers)
Aurick Zhou (11 papers)
Deirdre Quillen (5 papers)
Chelsea Finn (264 papers)
Sergey Levine (531 papers)

Citations (608)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos