Motion Planner Augmented Reinforcement Learning for Robot Manipulation in Obstructed Environments (2010.11940v1)

Published 22 Oct 2020 in cs.RO, cs.AI, and cs.LG

Abstract: Deep reinforcement learning (RL) agents are able to learn contact-rich manipulation tasks by maximizing a reward signal, but require large amounts of experience, especially in environments with many obstacles that complicate exploration. In contrast, motion planners use explicit models of the agent and environment to plan collision-free paths to faraway goals, but suffer from inaccurate models in tasks that require contacts with the environment. To combine the benefits of both approaches, we propose motion planner augmented RL (MoPA-RL) which augments the action space of an RL agent with the long-horizon planning capabilities of motion planners. Based on the magnitude of the action, our approach smoothly transitions between directly executing the action and invoking a motion planner. We evaluate our approach on various simulated manipulation tasks and compare it to alternative action spaces in terms of learning efficiency and safety. The experiments demonstrate that MoPA-RL increases learning efficiency, leads to a faster exploration, and results in safer policies that avoid collisions with the environment. Videos and code are available at https://clvrai.com/mopa-rl .

Authors (8)

Jun Yamada (11 papers)
Youngwoon Lee (23 papers)
Gautam Salhotra (10 papers)
Karl Pertsch (35 papers)
Max Pflueger (2 papers)
Gaurav S. Sukhatme (88 papers)
Joseph J. Lim (36 papers)
Peter Englert (9 papers)

Citations (41)

View on Semantic Scholar

Summary

This paper, "Motion Planner Augmented Reinforcement Learning for Robot Manipulation in Obstructed Environments" (Yamada et al., 2020 ), introduces MoPA-RL, a framework that combines model-free reinforcement learning (RL) with motion planning (MP) to tackle complex robot manipulation tasks in environments cluttered with obstacles. The core challenge addressed is that standard model-free RL struggles with exploration and collision avoidance in such environments, while traditional motion planners, effective for navigating static obstacles, fail on tasks requiring dynamic interaction and contact.

The key idea behind MoPA-RL is to augment the action space of an RL agent with the capabilities of a motion planner. The agent's policy outputs a desired joint displacement $\tilde{a}$ . The magnitude of this displacement determines the execution method:

If the maximum absolute value of any joint displacement in $\tilde{a}$ (i.e., $||\tilde{a}||_{\infty}$ ) is below a threshold $\Delta q_\text{step}$ , the action is executed directly, typically via a low-level feedback controller, allowing for precise, contact-rich movements.
If $||\tilde{a}||_{\infty}$ exceeds $\Delta q_\text{step}$ , indicating a large desired movement, a motion planner is invoked. The planner attempts to find a collision-free path from the current robot joint state to the target joint state $q + \tilde{a}$ . If a path is found, the sequence of joint displacements along this path is executed. This allows the robot to navigate efficiently through complex, obstructed spaces while avoiding collisions.

The framework can be viewed as training an RL agent on an augmented Markov Decision Process (MDP) or a semi-MDP, where a single action from the policy can correspond to a sequence of primitive actions executed over multiple time steps by the motion planner. The reward received for a motion planning action is the discounted sum of rewards accumulated along the generated path.

A practical challenge arises because the action space for motion planning ( $[-\Delta q_\text{MP}, \Delta q_\text{MP}]^d$ ) is typically much larger than the space for direct execution ( $[-\Delta q_\text{step}, \Delta q_\text{step}]^d$ ). With a naive uniform distribution, the probability of sampling a direct action becomes very low, especially in high-dimensional spaces. To counter this, the paper proposes an action space rescaling function. This piecewise linear function maps the policy output (in $[-1, 1]^d$ ) to joint displacements $\tilde{a}$ , dedicating a larger proportion of the policy output space to the direct action range $[-\Delta q_\text{step}, \Delta q_\text{step}]$ . A parameter $\omega \in [0, 1]$ controls the desired ratio, ensuring a more balanced exploration between direct actions and motion planning actions.

For implementation, the paper uses Soft Actor-Critic (SAC) as the base model-free RL algorithm, training standard feedforward neural networks for the policy and critic. To improve sample efficiency, especially for motion planning actions which cover long sequences of states and actions, the method samples $M$ intermediate sub-trajectories from each generated motion plan and adds them to the replay buffer.

The motion planner used is RRT-Connect from the Open Motion Planning Library (OMPL), chosen for its speed. Collision checking leverages the MuJoCo physics engine. Several optimizations are used to make the MP integration efficient:

Attempting a fast linear interpolation check before invoking the more expensive sampling-based planner.
Handling invalid target joint states (e.g., in collision or unreachable) by iteratively reducing the magnitude of the target displacement until a valid, collision-free goal state is found. This prevents the policy from getting stuck generating invalid targets.
Treating a grasped object as part of the robot for collision checking during planning.

The effectiveness of MoPA-RL is demonstrated through experiments on four simulated robotic manipulation tasks in obstructed environments: 2D Push, Sawyer Push, Sawyer Lift, and Sawyer Assembly. These tasks require navigating around obstacles and performing contact-rich maneuvers.

The experimental results show that MoPA-SAC (MoPA-RL using SAC) significantly outperforms baselines like standard SAC, SAC with linearly interpolated large actions (SAC Large), and SAC with inverse kinematics (SAC IK). MoPA-SAC is the only method that consistently solves all four challenging tasks, converging faster and achieving higher success rates. The paper attributes this to improved exploration: MoPA-SAC agents explore a much wider state space early in training due to the long-range, collision-free movements provided by the motion planner, as shown in the 2D Push example.

Furthermore, MoPA-RL leads to safer policies. By leveraging the collision-avoidance capabilities of the motion planner for large movements, the learned policies exhibit significantly lower average contact forces during successful task execution compared to conventional RL baselines. This is critical for real-world robot deployment.

Ablation studies highlight the importance of key components:

The choice of motion planning action range $\Delta q_\text{MP}$ impacts performance; too small hinders exploration, too large increases complexity.
Action space rescaling (controlled by $\omega$ ) is crucial for balancing exploration between direct actions and motion planning, preventing the agent from neglecting the direct action space needed for contact-rich tasks.
The hybrid action space (allowing both direct execution and MP) is essential; training without direct action execution fails on tasks requiring precise contact.
Reusing intermediate trajectories from motion plans (parameter $M$ ) improves sample efficiency, but using too many samples can bias the replay buffer and slightly hinder learning.
The method for handling invalid target joint states by reducing displacement magnitude is critical for training stability and efficiency.

In conclusion, MoPA-RL provides a practical and effective way to combine the strengths of model-free RL and motion planning, enabling robots to learn complex, contact-rich manipulation skills efficiently and safely in environments with significant obstacles. The framework is general enough to be combined with different model-free RL algorithms and motion planners, although performance may vary depending on the specific choices and hyperparameter tuning.

PDF Markdown

Related Papers

Find Related Papers