Residual Policy Learning in Robotics
- Residual Policy Learning is a hybrid approach that combines a fixed baseline controller with a learnable residual policy to improve control performance.
- It leverages an additive control policy where the residual component is trained using gradient-based methods while the baseline remains fixed, ensuring safety and efficiency.
- Empirical studies in robotic manipulation tasks show that RPL significantly enhances sample efficiency and overcomes challenges like sensor noise and model miscalibration.
Residual Policy Learning (RPL) is a class of algorithms that address the limitations of both traditional robotic controllers and deep reinforcement learning (RL) by integrating a fixed, possibly non-differentiable baseline controller with a learnable, data-driven residual policy. This methodology enables rapid, sample-efficient adaptation and correction of existing control strategies, particularly in complex manipulation domains characterized by partial observability, sensor noise, and model or actuator miscalibration. RPL has emerged as a unifying paradigm that leverages the strengths of classical control for safety and initial competence, while harnessing the flexibility of model-free RL for adaptation and fine-tuning in settings where pure RL methods are often data-inefficient or intractable.
1. Principle and Formalization
Residual Policy Learning defines a composite control policy as the sum of a fixed baseline controller and a trainable residual function : where is the system state, may be any externally supplied, possibly non-differentiable policy (such as a hand-designed or model-predictive controller), and is a neural network parameterized by .
A salient feature is that is held fixed and may be non-differentiable; this construction permits the use of gradient-based RL on exclusively. The transition dynamics for the induced residual Markov Decision Process (MDP) are reparameterized as: This recasting allows the RL agent to produce residual actions and effectively "wrap" the base policy as an environment. The overall action is thus generated by summing the baseline control output and the learned residual at every timestep.
2. Algorithmic Methods and Training
Implementation of RPL typically utilizes actor-critic algorithms such as Deep Deterministic Policy Gradient (DDPG) in conjunction with replay buffer techniques such as Hindsight Experience Replay (HER), which are particularly effective for sparse-reward, long-horizon robotic tasks.
Key implementation considerations include:
- Residual Initialization: The network is initialized such that for all (e.g., last-layer weights set to zero). This guarantees that at the outset, preventing destabilization of a good baseline.
- Partial Observability Handling: To address non-Markovian dynamics, a recurrence (such as concatenating the previous state) may be incorporated into the input space of .
- Gradient Computation: Gradients are only computed with respect to ; derivatives of do not participate, allowing arbitrary controllers—including black-box and non-differentiable ones—to serve as the baseline.
The optimization is performed in the "residual MDP," where the agent selects residual actions and observes the resulting transitions after the base policy output is added.
3. Classes of Baseline Controllers
RPL is applicable to a broad spectrum of baseline policies:
- Hand-designed/reactive policies: These controllers are often computationally efficient and robust under nominal conditions but brittle in the face of unmodeled dynamics (e.g., friction variation or sensor faults).
- Model-Predictive Controllers (MPCs): Both discretized (e.g., DiscreteMPCPush) and learned-transition-model (CachedPETS) MPCs serve as base controllers. While model-based methods excel at exploiting known system invariants, they are limited by model bias and computational tractability in high dimensions.
RPL “inherits” the strengths of these systems and learns to compensate for their idiosyncratic failures.
4. Task Domains and Empirical Results
RPL has been validated on a suite of challenging MuJoCo-based robotic manipulation tasks:
- Push and SlipperyPush: Object pushing under nominal and altered friction.
- PickAndPlace (with miscalibration): Oscillatory behavior due to poor gain tuning.
- NoisyHook and ComplexHook: Hook-based manipulation with severe sensor noise, heterogeneous object dynamics, and unobservable obstacles.
- MBRLPusher: Continuous, 7-DOF pushing task using model-based controllers as baseline.
Empirical findings demonstrate that RPL:
- Significantly improves performance over all baselines, often achieving near-optimal success rates even from high-quality starting controllers.
- Enhances data efficiency: On sparse-reward tasks, RPL achieves target performance with an order of magnitude fewer samples compared to RL from scratch.
- Outperforms pure RL in adversity: On tasks with partial observability, sensor corruption, or model mismatch, RPL succeeds where pure RL agents and non-adaptive controllers fail (e.g., increasing success from 15-55% baseline to >80%).
- Elevates model-based controllers: On MBRLPusher, combining RPL with CachedPETS improves asymptotic performance and reduces required environment interaction steps.
A controlled "Expert Explore" baseline experiment establishes that the performance gains of RPL are not limited to improved exploration, but are fundamentally due to the beneficial additive parameterization and zero-residual initialization.
5. Implementation and Scalability Considerations
RPL demands little modification to standard deep RL infrastructure:
- As the base controller can interface via any black-box API, integration is practical for both simulation and real robot systems.
- Initializing the residual function at zero preserves early safe operation—a critical feature for hardware deployment.
- Scaling to high-dimensional action spaces is feasible, especially in continuous control domains, due to the reduction of the effective learning task to local corrections rather than global policy specification.
However, strong reliance on the initial controller parameters may limit the capacity to outperform the baseline in regions far outside its competence, and the choice of parameterization for must be tailored to the underlying action and observation spaces.
6. Implications and Extensions
RPL constitutes a general paradigm for fusing explicit, structured knowledge with adaptable, model-free learning. Implications include:
- Hybridization of Control and Learning: RPL forms a bridge between classical (possibly analytical or heuristically designed) controllers and data-driven reinforcement learning, enabling practical improvements without abundant data collection or system identification.
- Robustness to Model Misspecification: By correcting for sensor noise, miscalibration, and unmodeled disturbances, RPL enables reliable deployment of RL in domains previously dominated by model-based approaches.
- Framework Generality: The additive residual architecture is applicable to any off-policy, gradient-based method and can be extended to other settings, such as shared autonomy, imitation learning, or meta-learning contexts.
Potential future directions encompass:
- Transferring the approach to real-world robotic platforms and complex, non-stationary environments.
- Integration with meta-learning algorithms for faster adaptation in rapidly changing domains.
- Exploration of more expressive or compositional residual functions for enhanced adaptability beyond local corrections.
RPL thereby provides a theoretically justified and empirically validated approach for augmenting and future-proofing classic controllers with deep, adaptive learning architectures.