Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Compliant Residual Policy Learning

Updated 1 July 2025
  • Compliant Residual Policy Learning is a strategy that enhances an existing base controller by adding a learned, differentiable residual function.
  • This method effectively bridges classical control and deep reinforcement learning, enabling robust performance in real-world robotics challenges like noise, partial observability, and model errors.
  • Empirical results demonstrate significant gains in data efficiency and task performance compared to training a policy from scratch, while preserving the initial controller's competence.

A compliant residual policy is a control and learning strategy that augments the output of an existing arbitrary policy—often a hand-designed or non-differentiable controller—by adding a learned, differentiable residual function. This formulation enables rapid, reliable, and data-efficient improvement of policies in complex domains, particularly in robotics and reinforcement learning, where initial controllers are good but imperfect. The approach serves as a principled bridge between high-capacity model-free reinforcement learning and classical control, facilitating robust performance in settings with partial observability, sensor noise, model misspecification, and controller miscalibration.

1. Formulation and Mathematical Framework

Residual Policy Learning (RPL) introduces a residual function fθ:SAf_\theta : \mathcal{S} \rightarrow \mathcal{A}, parameterized by θ\theta, into the overall policy: πθ(s)=π(s)+fθ(s)\pi_\theta(s) = \pi(s) + f_\theta(s) where:

  • π(s)\pi(s) is the initial, potentially non-differentiable or hand-crafted policy,
  • fθ(s)f_\theta(s) is a differentiable neural network (the "residual"),
  • sSs \in \mathcal{S} is the state, and aARda \in \mathcal{A} \subseteq \mathbb{R}^d is the action.

The learning problem is recast in a "residual MDP," M(π)M^{(\pi)}, with the transition: T(π)(s,a,s)=T(s,π(s)+a,s)T^{(\pi)}(s, a, s') = T(s, \pi(s) + a, s') so the residual policy learns in the space of corrections relative to the base policy.

A crucial property is that, since π\pi is fixed, gradients with respect to θ\theta are: θπθ(s)=θfθ(s)\nabla_\theta \pi_\theta(s) = \nabla_\theta f_\theta(s) allowing direct application of policy gradient or actor-critic algorithms.

2. Practical Role and Integration with Non-Differentiable Controllers

Compliant residual policies are particularly suited for scenarios where:

  • The base policy π\pi is hand-coded logic or a model-predictive controller, which may be non-differentiable or closed-source.
  • Model-free RL from scratch is either intractable or data inefficient due to exploration requirements, long horizons, or sparse rewards.

By learning only a correction, fθf_\theta, rather than the entire mapping, RPL:

  • Inherits exploration structure from the prior policy, facilitating faster discovery of reward in difficult tasks.
  • Provides fine-grained corrections where the base policy fails (e.g., due to unmodeled friction, partial observability, or calibration errors).
  • Ensures compliance, in that if fθ(s)f_\theta(s) is zero, the system defaults to π(s)\pi(s)—never making an initially good policy worse.

3. Addressing Real-World Robotics and Control Challenges

Empirical research demonstrates that compliant residual policies address several core challenges:

  • Partial Observability: RPL can incorporate state histories or recurrent architectures, mitigating unobserved aspects of the task.
  • Sensor Noise: Aggregating or filtering recent state inputs, the residual policy smooths out corruptions in state estimation.
  • Model Misspecification: When environmental dynamics diverge from assumptions (e.g., unknown friction, mass, or obstacles), the residual policy adapts the base controller to actual task conditions.
  • Controller Miscalibration: The residual compensates for parameters in the base policy (such as gain or set-point errors), yielding stable and robust behavior.

4. Experimental Evaluation and Performance Gains

RPL has been benchmarked in six MuJoCo manipulation environments, encompassing:

  • Block pushing with MPC and reactive controllers,
  • Pick-and-place and hook-based manipulation with noise and hidden variables,
  • A 7-DOF arm executing model-based behaviors under uncertain conditions.

Comparative Performance Table:

Task Initial Policy RL-from-Scratch RPL Performance Key Improvement
Push ∼0.5 ∼1.0 (slow) RPL converges faster, perfect policy Data efficiency
SlipperyPush ∼0.45 ∼1.0 (slow) Matches RL, much faster Corrects misspecification
PickAndPlace ∼0.5 10x more data RPL to 1.0, 10x faster Sample efficiency
NoisyHook ∼0.15 Fails RPL to ∼0.8 Handles noise/obs. partiality
ComplexHook ∼0.55 Fails RPL to ∼0.8 Surpasses base capability
MBRLPusher PETS MBRL DDPG+HER (slow) Bests MBRL, fast converge Hybrid model-based/model-free

Notable findings:

  • RPL always improved the base controller.
  • It delivered order-of-magnitude gains in data efficiency and solved tasks out of reach for RL alone.

5. Implementation Considerations and Strategies

  • Initialization: The residual is initialized to produce zero output, ensuring no immediate deviation from the base policy and preserving initial competence.
  • Burn-in for Critic: In actor-critic frameworks (e.g., DDPG), the critic is pretrained on the initial policy’s returns before learning the actor, guarding against premature degradation.
  • Base Policy Flexibility: Any controller—hand-coded, model-predictive (with known or learned dynamics), or even a learned policy—can serve as the baseline.
  • Deployment: The approach requires only the ability to sample and add outputs from the base and learned residual at runtime, making it compatible with black-box or legacy systems.

6. Broader Implications and Synthesis

Compliant residual policy learning functions as a bridge between traditional control and deep RL:

  • It leverages the structure, safety, and robustness of existing controllers as a foundation, while extending their capabilities to unanticipated operating regimes or environments.
  • The formulation reduces the effective complexity of the learning task by focusing on corrections rather than full policy discovery.
  • The approach generalizes across RL algorithms (applicable to DDPG, HER, and others), and is not limited to manipulation: it has been successfully extended to antenna control, powertrain optimization, autonomous driving, and shared autonomy contexts in later works.

Conclusion:

Compliant residual policy learning is a practical, general, and effective strategy for the improvement of arbitrary initial controllers in reinforcement learning and robotics. It delivers robust gains in speed, efficiency, and performance, particularly in domains characterized by complex dynamics, imperfect models, and partial observation, by combining the strengths of classical control and high-capacity learning.