Compliant Residual Policy Learning
- Compliant Residual Policy Learning is a strategy that enhances an existing base controller by adding a learned, differentiable residual function.
- This method effectively bridges classical control and deep reinforcement learning, enabling robust performance in real-world robotics challenges like noise, partial observability, and model errors.
- Empirical results demonstrate significant gains in data efficiency and task performance compared to training a policy from scratch, while preserving the initial controller's competence.
A compliant residual policy is a control and learning strategy that augments the output of an existing arbitrary policy—often a hand-designed or non-differentiable controller—by adding a learned, differentiable residual function. This formulation enables rapid, reliable, and data-efficient improvement of policies in complex domains, particularly in robotics and reinforcement learning, where initial controllers are good but imperfect. The approach serves as a principled bridge between high-capacity model-free reinforcement learning and classical control, facilitating robust performance in settings with partial observability, sensor noise, model misspecification, and controller miscalibration.
1. Formulation and Mathematical Framework
Residual Policy Learning (RPL) introduces a residual function , parameterized by , into the overall policy: where:
- is the initial, potentially non-differentiable or hand-crafted policy,
- is a differentiable neural network (the "residual"),
- is the state, and is the action.
The learning problem is recast in a "residual MDP," , with the transition: so the residual policy learns in the space of corrections relative to the base policy.
A crucial property is that, since is fixed, gradients with respect to are: allowing direct application of policy gradient or actor-critic algorithms.
2. Practical Role and Integration with Non-Differentiable Controllers
Compliant residual policies are particularly suited for scenarios where:
- The base policy is hand-coded logic or a model-predictive controller, which may be non-differentiable or closed-source.
- Model-free RL from scratch is either intractable or data inefficient due to exploration requirements, long horizons, or sparse rewards.
By learning only a correction, , rather than the entire mapping, RPL:
- Inherits exploration structure from the prior policy, facilitating faster discovery of reward in difficult tasks.
- Provides fine-grained corrections where the base policy fails (e.g., due to unmodeled friction, partial observability, or calibration errors).
- Ensures compliance, in that if is zero, the system defaults to —never making an initially good policy worse.
3. Addressing Real-World Robotics and Control Challenges
Empirical research demonstrates that compliant residual policies address several core challenges:
- Partial Observability: RPL can incorporate state histories or recurrent architectures, mitigating unobserved aspects of the task.
- Sensor Noise: Aggregating or filtering recent state inputs, the residual policy smooths out corruptions in state estimation.
- Model Misspecification: When environmental dynamics diverge from assumptions (e.g., unknown friction, mass, or obstacles), the residual policy adapts the base controller to actual task conditions.
- Controller Miscalibration: The residual compensates for parameters in the base policy (such as gain or set-point errors), yielding stable and robust behavior.
4. Experimental Evaluation and Performance Gains
RPL has been benchmarked in six MuJoCo manipulation environments, encompassing:
- Block pushing with MPC and reactive controllers,
- Pick-and-place and hook-based manipulation with noise and hidden variables,
- A 7-DOF arm executing model-based behaviors under uncertain conditions.
Comparative Performance Table:
Task | Initial Policy | RL-from-Scratch | RPL Performance | Key Improvement |
---|---|---|---|---|
Push | ∼0.5 | ∼1.0 (slow) | RPL converges faster, perfect policy | Data efficiency |
SlipperyPush | ∼0.45 | ∼1.0 (slow) | Matches RL, much faster | Corrects misspecification |
PickAndPlace | ∼0.5 | 10x more data | RPL to 1.0, 10x faster | Sample efficiency |
NoisyHook | ∼0.15 | Fails | RPL to ∼0.8 | Handles noise/obs. partiality |
ComplexHook | ∼0.55 | Fails | RPL to ∼0.8 | Surpasses base capability |
MBRLPusher | PETS MBRL | DDPG+HER (slow) | Bests MBRL, fast converge | Hybrid model-based/model-free |
Notable findings:
- RPL always improved the base controller.
- It delivered order-of-magnitude gains in data efficiency and solved tasks out of reach for RL alone.
5. Implementation Considerations and Strategies
- Initialization: The residual is initialized to produce zero output, ensuring no immediate deviation from the base policy and preserving initial competence.
- Burn-in for Critic: In actor-critic frameworks (e.g., DDPG), the critic is pretrained on the initial policy’s returns before learning the actor, guarding against premature degradation.
- Base Policy Flexibility: Any controller—hand-coded, model-predictive (with known or learned dynamics), or even a learned policy—can serve as the baseline.
- Deployment: The approach requires only the ability to sample and add outputs from the base and learned residual at runtime, making it compatible with black-box or legacy systems.
6. Broader Implications and Synthesis
Compliant residual policy learning functions as a bridge between traditional control and deep RL:
- It leverages the structure, safety, and robustness of existing controllers as a foundation, while extending their capabilities to unanticipated operating regimes or environments.
- The formulation reduces the effective complexity of the learning task by focusing on corrections rather than full policy discovery.
- The approach generalizes across RL algorithms (applicable to DDPG, HER, and others), and is not limited to manipulation: it has been successfully extended to antenna control, powertrain optimization, autonomous driving, and shared autonomy contexts in later works.
Conclusion:
Compliant residual policy learning is a practical, general, and effective strategy for the improvement of arbitrary initial controllers in reinforcement learning and robotics. It delivers robust gains in speed, efficiency, and performance, particularly in domains characterized by complex dynamics, imperfect models, and partial observation, by combining the strengths of classical control and high-capacity learning.