Learning to Walk in Minutes Using Massively Parallel Deep Reinforcement Learning (2109.11978v3)

Published 24 Sep 2021 in cs.RO and cs.LG

Abstract: In this work, we present and study a training set-up that achieves fast policy generation for real-world robotic tasks by using massive parallelism on a single workstation GPU. We analyze and discuss the impact of different training algorithm components in the massively parallel regime on the final policy performance and training times. In addition, we present a novel game-inspired curriculum that is well suited for training with thousands of simulated robots in parallel. We evaluate the approach by training the quadrupedal robot ANYmal to walk on challenging terrain. The parallel approach allows training policies for flat terrain in under four minutes, and in twenty minutes for uneven terrain. This represents a speedup of multiple orders of magnitude compared to previous work. Finally, we transfer the policies to the real robot to validate the approach. We open-source our training code to help accelerate further research in the field of learned legged locomotion.

Authors (4)

Nikita Rudin (13 papers)
David Hoeller (15 papers)
Philipp Reist (2 papers)
Marco Hutter (165 papers)

Citations (439)

View on Semantic Scholar

Summary

The paper demonstrates a novel massively parallel deep reinforcement learning approach that trains robotic locomotion policies on GPUs in minutes.
The methodology leverages NVIDIA’s Isaac Gym to simulate thousands of robots concurrently with custom PPO adjustments, maximizing simulation throughput.
Experimental results show ANYmal successfully learning flat terrain in 4 minutes and complex terrains in 20 minutes, with effective sim-to-real transfer.

Massively Parallel Deep Reinforcement Learning for Rapid Robotic Locomotion

The paper "Learning to Walk in Minutes Using Massively Parallel Deep Reinforcement Learning" presents an innovative approach to expedite the training of deep reinforcement learning (DRL) models for robotic locomotion tasks. The work, conducted by researchers at ETH Zurich and NVIDIA, leverages massive parallelism to achieve rapid training of policies on a single GPU, significantly reducing the time typically needed for such tasks.

Methodological Insights

The primary methodological advancement in this paper is the utilization of NVIDIA's Isaac Gym simulation environment for massively parallel simulations. The approach involves simulating thousands of robotic agents concurrently, substantially increasing the data throughput used in policy training. This setup is designed to circumvent traditional drawbacks related to CPU-bound simulations where communication bottlenecks typically restrict the use of GPUs for efficient training. By conducting both simulation and learning processes on the GPU, the authors eliminate significant data transfer overheads, thereby enhancing processing efficiency.

The training utilizes a custom implementation of the Proximal Policy Optimization (PPO) algorithm. This model operates by collecting data across multiple simulated environments, then performing policy updates using the accumulated information. The approach is complemented by a robust, game-inspired curriculum that adapits task difficulty based on the current performance, facilitating an accelerated path to learning across complex terrains.

Key modifications to the PPO setup include adjustments to hyper-parameters like batch sizes and mini-batch sizes, which are expanded significantly relative to typical configurations. Essential optimizations focus on maximizing the simulation throughput, crucial for leveraging the full computational capabilities of the GPU.

Experimental Outcomes

Training was demonstrated on the quadrupedal robot ANYmal, with the aim of handling various terrain challenges. The results are compelling: policies capable of navigating flat terrain are trained within four minutes, while those for uneven, complex terrains are developed within twenty minutes. This constitutes an extraordinary improvement in training speed, reported to be several orders of magnitude faster than preceding benchmarks.

For validation, the trained policies were applied in real-world scenarios using physical robots, highlighting the system's practical applicability. The robustness and successful transfer from simulation to reality affirm the efficacy of the methodology, although some adjustments such as scaling velocity commands were necessary to account for imperfections in real-world sensing.

Implications and Future Directions

The implications of this research are manifold. Practically, the ability to train policies swiftly opens new possibilities for adaptive and customized robot deployment, self-training within specific environments, or iterating over more robust and refined policies during developmental cycles. Theoretically, the work provides a case paper on the advantages of integrating simulation and policy training, prompting further exploration into massively parallel architectures.

Future directions may explore even broader applications, extending beyond robotic locomotion to other domains where DRL can benefit from rapid, high-throughput training. Additionally, further refinement of the nuances in the sim-to-real transfer, including improved handling of sensor inaccuracies, can augment the task complexity that these models can robustly manage.

This paper marks a significant stride in the domain of DRL for robotics, demonstrating the potential of synchronous parallel environments driven by state-of-the-art hardware, setting a precedent for high-speed learning across complex, real-world tasks.

PDF Markdown

Related Papers

Tweets

https://twitter.com/KyleMorgenstein/status/1801675446200754466

https://twitter.com/KyleMorgenstein/status/1905271463227072639

https://twitter.com/axelbrunnbauer/status/1841872419662831638

YouTube

Show All Videos