Deep Reinforcement Learning for Robotic Manipulation with Asynchronous Off-Policy Updates (1610.00633v2)

Published 3 Oct 2016 in cs.RO, cs.AI, and cs.LG

Abstract: Reinforcement learning holds the promise of enabling autonomous robots to learn large repertoires of behavioral skills with minimal human intervention. However, robotic applications of reinforcement learning often compromise the autonomy of the learning process in favor of achieving training times that are practical for real physical systems. This typically involves introducing hand-engineered policy representations and human-supplied demonstrations. Deep reinforcement learning alleviates this limitation by training general-purpose neural network policies, but applications of direct deep reinforcement learning algorithms have so far been restricted to simulated settings and relatively simple tasks, due to their apparent high sample complexity. In this paper, we demonstrate that a recent deep reinforcement learning algorithm based on off-policy training of deep Q-functions can scale to complex 3D manipulation tasks and can learn deep neural network policies efficiently enough to train on real physical robots. We demonstrate that the training times can be further reduced by parallelizing the algorithm across multiple robots which pool their policy updates asynchronously. Our experimental evaluation shows that our method can learn a variety of 3D manipulation skills in simulation and a complex door opening skill on real robots without any prior demonstrations or manually designed representations.

PDF Abstract

Asynchronous Off-Policy Deep Reinforcement Learning for Robotic Manipulation

The paper "Deep Reinforcement Learning for Robotic Manipulation with Asynchronous Off-Policy Updates" by Gu et al. presents a method to enhance the efficiency of deep reinforcement learning (DRL) applied to robotic manipulation tasks. Addressing the high sample-complexity often associated with DRL, the authors propose an asynchronous learning framework that allows multiple robots to learn a shared policy concurrently.

Summary of Approach

The core contribution of the paper is the integration of asynchronous updates in the DRL training process. The authors employ a centralized learner that asynchronously updates a shared policy network, while distributed worker threads collect data by executing the policy on physical robots. This parallelized approach aims to mitigate the traditionally prohibitive training times associated with applying DRL to real-world robotic tasks.

To maximize sample efficiency, the authors build on the Normalized Advantage Function (NAF) algorithm, an off-policy method that extends Q-learning to continuous action spaces. NAF is chosen over alternatives like Deep Deterministic Policy Gradient (DDPG) due to its simplicity and fewer hyperparameters. The asynchronous variant of NAF (termed Asynchronous NAF) leverages parallelism, allowing multiple robots to pool their experiences into a shared replay buffer from which the central learning thread samples. This setup capitalizes on the physical constraints of real-time robotic operation to expedite the policy learning process.

Implementation and Experimental Evaluation

Simulation Environments

The paper provides a thorough assessment via simulated tasks modeled in the MuJoCo physics simulator. The environments reflect real-world complexities:

Reaching Task: A 7-degree-of-freedom (DoF) robotic arm learns to reach target positions randomly sampled within a predefined space.
Door Manipulation Task: The same robotic arm learns either pushing or pulling to open a door. The reward structure includes terms for the end-effector's distance to the handle and the door's angular displacement.
Pick & Place Task: A Kinova JACO arm learns to grasp a suspended stick and place it at various target positions.

The simulation results emphasize that deep neural network representations for policies significantly outperform simpler linear models, particularly for more complex tasks requiring nuanced interaction dynamics, such as door manipulation.

Real-World Application

The real-world applicability of the proposed method is demonstrated through:

Random Target Reaching: Multiple robots, using Asynchronous NAF, perform the reaching task with varying positional targets. The parallelism significantly accelerated learning, achieving notable improvements in both time to convergence and final policy performance.
Door Opening: Conducted with a 7-DoF arm, this task required the robot to learn to pull open a door autonomously. Utilizing two robots in parallel, the method achieved a 100% success rate in approximately 2.5 hours, demonstrating the practical applicability of asynchronously learned policies.

Implications and Future Directions

The implications of this asynchronous parallel learning framework are twofold. Practically, the method shows substantial promise for reducing training times, making DRL for robotic manipulation more feasible within real-world constraints. Theoretically, the results underscore the efficacy of leveraging asynchronous updates and parallel experience collection to mitigate the high sample complexity typically associated with DRL.

For future developments, there are several avenues to explore:

Sparse Reward Structures: Investigating the scalability of the method to more challenging tasks characterized by sparse rewards could broaden the applicability to new, less-defined tasks.
Multi-robot Experience Generalization: Expanding the framework to handle diverse experiences collected across different robotic platforms and environments could further enhance the generalizability and robustness of learned policies.
Hardware Acceleration: Improving the computational architecture, such as leveraging more advanced parallel processing units (e.g., GPUs), could unlock further efficiency gains, potentially pushing the boundaries of policy complexity and training speed.

Overall, this paper provides a significant step towards practical, efficient DRL for complex robotic manipulation, promoting future advancements both in the theoretical underpinnings of DRL and its real-world applications.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Shixiang Gu (23 papers)
Ethan Holly (2 papers)
Timothy Lillicrap (60 papers)
Sergey Levine (531 papers)

Citations (1,428)

View on Semantic Scholar