Recovery RL: Safe Reinforcement Learning with Learned Recovery Zones (2010.15920v2)

Published 29 Oct 2020 in cs.LG, cs.AI, and cs.RO

Abstract: Safety remains a central obstacle preventing widespread use of RL in the real world: learning new tasks in uncertain environments requires extensive exploration, but safety requires limiting exploration. We propose Recovery RL, an algorithm which navigates this tradeoff by (1) leveraging offline data to learn about constraint violating zones before policy learning and (2) separating the goals of improving task performance and constraint satisfaction across two policies: a task policy that only optimizes the task reward and a recovery policy that guides the agent to safety when constraint violation is likely. We evaluate Recovery RL on 6 simulation domains, including two contact-rich manipulation tasks and an image-based navigation task, and an image-based obstacle avoidance task on a physical robot. We compare Recovery RL to 5 prior safe RL methods which jointly optimize for task performance and safety via constrained optimization or reward shaping and find that Recovery RL outperforms the next best prior method across all domains. Results suggest that Recovery RL trades off constraint violations and task successes 2 - 20 times more efficiently in simulation domains and 3 times more efficiently in physical experiments. See https://tinyurl.com/rl-recovery for videos and supplementary material.

PDF Abstract

Analysis and Implications of "Recovery RL: Safe Reinforcement Learning With Learned Recovery Zones"

In "Recovery RL: Safe Reinforcement Learning with Learned Recovery Zones," the authors propose an innovative approach to address a significant challenge in reinforcement learning (RL): the balance between exploratory behavior, essential for learning, and safety constraints, required for real-world applicability. This work introduces Recovery RL, a method that delineates between task performance optimization and constraint satisfaction by employing two separate policies—a task policy and a recovery policy.

At the core of Recovery RL is the partitioning of objectives. The task policy focuses solely on maximizing task rewards while operating in a modified MDP environment that the recovery policy informs. The recovery policy is activated when a potential violation of constraints is detected by a safety critic, redirecting the agent's actions to mitigate risk. By distinctly separating these objectives, Recovery RL avoids the limitations inherent in approaches that attempt joint optimization, where safety constraints can hinder task performance.

Notably, Recovery RL utilizes offline data to learn about potential constraint violations before online policy deployment, reducing the need for dangerous exploratory phases. This offline learning phase involves training the safety critic to estimate the probability of future constraint violations using examples of previously occurring violations. This setup not only ensures improved safety during online operation but also reduces online learning complexity and time.

The empirical evaluations of Recovery RL on six simulated domains—ranging from 2D navigation tasks to contact-rich manipulation—and on a physical robotic system demonstrated its superior performance in maintaining a balanced ratio of task successes to constraint violations compared to five existing safe RL methods. Recovery RL achieved a 2-to-20 times better efficiency in this tradeoff across simulations and thrice the efficiency in real-world robotic applications.

The strong numeric results from these experiments underscore the algorithm's efficiency in leveraging the dual-policy structure to robustly address the safety-performance tradeoff. The practical implications suggest that Recovery RL can effectively be scaled to complex, real-world environments, enhancing the safety and robustness of autonomous systems. Moreover, this work opens the door to various future research directions, including formal safety guarantees for learned recovery zones and integrating techniques from the offline RL domain to enhance recovery policy training.

Overall, "Recovery RL: Safe Reinforcement Learning with Learned Recovery Zones" makes a substantial contribution to the RL landscape by presenting a novel approach to a pervasive problem. Its implications are vast, suggesting that future developments in AI—specifically in safe RL—could greatly benefit from using separate policies for task optimization and safety, particularly in dynamic and uncertain real-world environments where safety cannot be compromised.

PDF Markdown Bookmark Chat (Pro)

Authors (10)

Brijen Thananjeyan (26 papers)
Ashwin Balakrishna (40 papers)
Suraj Nair (39 papers)
Michael Luo (13 papers)
Krishnan Srinivasan (14 papers)
Minho Hwang (16 papers)
Joseph E. Gonzalez (167 papers)
Julian Ibarz (26 papers)
Chelsea Finn (264 papers)
Ken Goldberg (162 papers)

Citations (197)

View on Semantic Scholar

Recovery RL: Safe Reinforcement Learning with Learned Recovery Zones (2010.15920v2)

Analysis and Implications of "Recovery RL: Safe Reinforcement Learning With Learned Recovery Zones"

Related Papers