- The paper introduces a technique that learns a residual function to improve non-differentiable base policies using deep reinforcement learning.
- It employs a zero-initialized neural network with methods like DDPG and HER to preserve initial policy behavior while learning corrections.
- Experiments on robotic manipulation tasks demonstrate that RPL outperforms learning from scratch and expert exploration, enhancing efficiency and performance.
Residual Policy Learning (RPL) is a method for improving existing policies, which can be non-differentiable, using model-free deep reinforcement learning. The core idea is to augment an initial policy π(s) with a learned residual function fθ(s), resulting in a combined policy πθ(s)=π(s)+fθ(s). The residual function fθ(s) is parameterized by a neural network and learned using standard deep reinforcement learning techniques. A key practical advantage is that the gradient of the combined policy with respect to the learned parameters θ only depends on the residual, ∇θπθ(s)=∇θfθ(s), allowing non-differentiable initial policies to be incorporated.
The motivation behind RPL stems from the challenges of applying deep RL to complex robotic manipulation tasks. Learning policies from scratch can be data-intensive or intractable for long-horizon, sparse-reward problems. Conversely, while hand-designed or model-predictive controllers exist, they often struggle with real-world complexities like sensor noise, model inaccuracies, or environmental variations. RPL aims to bridge this gap by using the initial controller as a starting point or "hint," and letting RL learn the necessary corrections or refinements through the residual function.
For practical implementation, the residual function fθ is typically represented by a deep neural network. The paper uses Deep Deterministic Policy Gradients (DDPG) combined with Hindsight Experience Replay (HER) for training, which are well-suited for continuous action spaces and sparse rewards prevalent in robotic tasks. To ensure that the learned policy initially behaves like the base policy, the residual function is initialized to output zero for all states. This is achieved by initializing the final layer of the residual network with zero weights.
When using actor-critic methods like DDPG, the critic (value function) is trained alongside the actor (policy). The paper suggests a "burn-in" period where only the critic is trained initially, while the actor (residual policy) remains fixed to the base policy. This helps the critic learn meaningful values before the policy starts changing based on potentially poor initial value estimates. For partially observable tasks, RPL can be extended by making the residual network recurrent or, as an approximation used in the paper, feeding a history of recent states to the network.
The paper evaluates RPL on six complex robotic manipulation tasks implemented in MuJoCo, featuring challenges like partial observability, sensor noise, model misspecification, controller miscalibration, and complex objects/environments:
- Push: A standard task to push a cube to a target. Uses a Discrete MPC policy as the initial controller, which struggles with limited search depth and action discretization.
- SlipperyPush: A variation of Push with lower object friction, designed to test model misspecification. Uses a hand-designed reactive policy tuned for high friction as the initial controller, which often pushes the object off the table.
- PickAndPlace: A task requiring picking up a cube and placing it at a target, possibly in the air. Uses a hand-designed reactive policy with artificially miscalibrated gains, leading to oscillatory behavior.
- NoisyHook: A task where the robot must use a hook to move a block, featuring observation noise on object and hook positions/rotations. Uses a hand-designed reactive policy designed for a noiseless environment, which is sensitive to noise.
- ComplexHook: A task using the hook to move objects with diverse shapes/physics properties in environments with randomly placed obstacles (bumps). The state does not include information about the object type or obstacles, testing structured uncertainty and partial observability. Uses the same noise-free reactive hook policy.
- MBRLPusher: A task using a 7-DOF arm to push a cylinder to a target, taken from a model-based RL benchmark. Uses a cached version of a PETS (model-based RL) controller as the initial policy, exploring combination with MBRL.
The evaluation compares RPL against three baselines: the initial policy alone, learning from scratch using DDPG+HER, and an "Expert Explore" baseline where the initial policy is used only for exploration (ϵ-greedy style).
The results consistently demonstrate that RPL significantly improves the performance of the initial policies across all tasks. Furthermore, RPL often achieves the same or better asymptotic performance compared to learning from scratch but with substantially higher data efficiency. In tasks like NoisyHook and ComplexHook, where the horizon is long and rewards are sparse, learning from scratch with DDPG+HER fails to achieve non-trivial success rates, while RPL successfully learns policies with high success rates by building on the initial imperfect controllers. The Expert Explore baseline typically performs better than learning from scratch but worse than RPL, suggesting that RPL's advantage comes not only from improved exploration guided by the initial policy but also from the specific residual parameterization and initialization.
The paper discusses three main reasons for RPL's success:
- Initialization: Starting with the initial policy's output (by initializing the residual to zero) provides a strong starting point if the base policy is even moderately good.
- Improved Exploration: The initial policy provides a non-random way to explore the environment, potentially reaching states where rewards are encountered more frequently than with purely random actions.
- Easier Problem: The residual learning problem may be inherently simpler than learning the full policy from scratch, focusing on correcting errors rather than discovering the entire behavior.
RPL is presented as a general method applicable to any domain with continuous actions and gradient-based learning, but particularly well-suited for complex robotic manipulation due to the common availability of initial controllers and the inherent difficulty of learning from scratch. The combination with model-based RL (shown with CachedPETS) suggests a path to leverage the data efficiency of MBRL and the asymptotic performance of model-free methods.
In summary, RPL provides a practical and effective framework for leveraging existing imperfect controllers to significantly accelerate and improve deep reinforcement learning in challenging domains like robotic manipulation, particularly in the presence of real-world complexities and sparse rewards.