Residual Reinforcement Learning
- Residual RL is a control strategy that decomposes decision-making by combining a fixed base policy with a learnable residual for task-specific corrections.
- It leverages RL algorithms with constraints on residual actions to ensure safety, stability, and improved sample efficiency compared to learning from scratch.
- This approach has been successfully applied in robotics, autonomous vehicles, and other control systems, enabling robust adaptation and effective sim-to-real transfer.
Residual Reinforcement Learning (RL) is a principled methodology for adapting, augmenting, or accelerating policy learning in control systems or decision-making problems by learning corrections on top of a fixed or pretrained "base" policy. This approach modularizes control: a conventional controller (model-based, demonstration-based, or otherwise handcrafted) is responsible for the "easy" aspects of the task or provides strong prior knowledge, while the RL agent focuses its capacity on learning a lightweight, task-specific residual that enables adaptation, robust performance, and sample-efficient learning. Below, the key principles, algorithmic architectures, theoretical properties, and notable empirical advances in residual RL are presented.
1. Formal Foundations and Core Architectures
The central construct in residual RL is the decomposition of the overall policy into a (fixed or frozen) base policy and a learnable residual policy :
where is the state at time and is the action. In some frameworks, the combination can include a scaling factor or be implemented via convex interpolation (mixing): . The base policy may be a conventional feedback controller (e.g., PD, LQR, H-infinity) (Johannink et al., 2018, Zuo et al., 2023, Staessens et al., 2021), a policy trained via behavior cloning from demonstrations (Alakuijala et al., 2021, Ankile et al., 23 Sep 2025, Ankile et al., 23 Jul 2024), a skill-based latent variable model (Rana et al., 2022), a model-predictive control (MPC) planner (Jeon et al., 14 Oct 2025), or a model-based policy (e.g., some analytical, physics-informed policy) (Sheng et al., 30 Aug 2024, Kalaria et al., 9 Oct 2024, Nakhaei et al., 12 Jun 2024).
The residual policy is learned through reinforcement learning, typically initialized as a zero-mean module and updated either in an off-policy or on-policy manner. In the dominant approach, only the residual is learned (i.e., is held fixed), which ensures that the system cannot deviate far from known-good behavior, improving the safety, exploration efficiency, and often the stability of training.
Two principal additive compositions are:
- Command-level residual: outputs direct action corrections in the action space (e.g., joint torques, velocities, or control accelerations).
- Feedback residual: Particularly when internal feedback loops exist in a base controller, the residual modifies an upstream feedback or reference signal rather than the raw output (Ranjbar et al., 2021).
Residual architectures may further be extended with constraints on residual magnitude (absolute or relative), as in safety-critical domains (Staessens et al., 2021), or with learned mixing factors over the respective policies (Zuo et al., 2023).
2. Algorithmic and Model Design
The algorithmic realization of residual RL revolves around (1) defining the joint policy architecture, (2) specifying a reward function aligned with the overall task (often sparse or delayed), and (3) implementing suitable RL algorithms for the residual. Key design choices include:
- Residual policy parameterization: Usually an MLP or convolutional network for low-dimensional states, possibly augmented with vision or proprioception (Alakuijala et al., 2021). Residuals may be made state-dependent (i.e., ) to allow different corrections in response to the base policy's intended action.
- Choice of RL algorithm: Off-policy actor-critic methods (e.g., TD3, DDPG, SAC) are commonly used for continuous control (Johannink et al., 2018, Staessens et al., 2021, Alakuijala et al., 2021, Ankile et al., 23 Sep 2025). On-policy methods (e.g., PPO) are preferred in settings with nonstationary dynamics or where the combination policy invalidates strict off-policy assumptions (Zuo et al., 2023, Jeon et al., 14 Oct 2025, Kalaria et al., 9 Oct 2024).
- Residual constraints and safety: To preserve the properties of the base controller, residuals are often clipped into a “tube” around the base action, either by magnitude or as a percentage of the base policy's output (Staessens et al., 2021, Zuo et al., 2023). Lyapunov-based analyses demonstrate that such constraints yield uniformly ultimately bounded error and stability for generic mechanical systems (Staessens et al., 2021).
- Initialization and training loop: It is common to pre-populate the replay buffer with samples from the base policy before policy updates, particularly in safety-critical or critic pretraining settings (Alakuijala et al., 2021, Ankile et al., 23 Sep 2025).
- Handling of stochastic or imperfect base policies: In settings with stochastic base policies (e.g., diffusion models for chunked actions or uncertainty-aware controllers), modifications to both the training updates (i.e., to observe base actions explicitly in the update) and exploration strategies (possibly focusing on high-uncertainty states) are applied (Dodeja et al., 21 Jun 2025).
3. Theoretical Properties: Stability, Safety, and Sample Efficiency
Stability and safety: Residual RL inherits stability guarantees from the base policy under the assumption that the residual is bounded. Analyses on classical nonlinear systems (via Lyapunov or ISS theory) demonstrate that with sufficiently small residuals, error trajectories remain bounded within a tube around the nominal controller (Staessens et al., 2021). In robust control contexts (e.g., H-infinity), residuals are treated as input disturbances and explicit bounds on the residual mixing factor guarantee closed-loop stability even with adversarial residuals (Zuo et al., 2023).
Sample efficiency: By focusing RL on a low-dimensional, incrementally learnable residual action space, orders-of-magnitude improvement in sample complexity have been demonstrated relative to learning policies from scratch (Johannink et al., 2018, Li et al., 2021, Alakuijala et al., 2021). In high-dimensional tracking, decoupling the system into low-dimensional subsystems and running parallel critics for each residual enables scalable deployment on manipulators with up to 7-DoF or more (Li et al., 2021).
Generalization and sim-to-real transfer: Because the base policy dominates action selection initially, and the residual can be trained on modest real-world samples, sim-to-real gap is dramatically reduced. Real-world deployment after simulation training with only minimal further adaptation is possible in manipulation (Johannink et al., 2018, Ankile et al., 23 Sep 2025, Huang et al., 2 Aug 2025) and legged locomotion (Jeon et al., 14 Oct 2025).
4. Extensions: Hybridization, Skills, Model-Based Control, and Context Adaptation
Residual RL serves as an enabler for multiple advanced learning paradigms:
- Skill-based and hierarchical RL: Residuals can be applied to the atomic actions of decoded latent skills, granting a high-level skill RL agent rapid exploration, and the residual policy fine adaptation for unseen task variations (Rana et al., 2022).
- Trajectory and reference refinements: Residuals need not act at the single-step action level. Episodic frameworks (e.g., MoRe-ERL) learn trajectory-level residuals via B-splines or action chunks, enabling efficient refinement of preplanned (possibly long-horizon) motions while preserving critical reference segment structure (Huang et al., 2 Aug 2025, Ankile et al., 23 Jul 2024).
- Model-based residuals: Residuals may also be learned to directly correct the predictions of a hand-designed analytic or model-based dynamics model (e.g., Intelligent Driver Model, rigid-body simulators), yielding a knowledge-informed, adaptive, and robust virtual environment (Sheng et al., 30 Aug 2024, Kalaria et al., 9 Oct 2024). In these settings, residuals are crucial for handling unmodeled dynamics, process noise, and environmental perturbations, such as changing vehicle physics or external disturbances.
- Context adaptation via residuals: By incorporating online learned context encodings (e.g., via trajectory windowed embeddings), residuals can be made adaptive to rapid environment or system changes—particularly advantageous for offline-to-online RL with changing dynamics (Nakhaei et al., 12 Jun 2024).
- Augmenting advanced controllers: Residuals can be composited with robust, real-time, or entire-body planners such as MPC for legged robots or aerial vehicles, achieving interpretable, constraint-satisfying, yet highly adaptable control (Jeon et al., 14 Oct 2025, Zhang et al., 20 Sep 2024). The residual corrects for model-mismatch, unmodeled contacts, and unanticipated failures, while the base controller ensures adherence to global operational constraints.
5. Empirical Gains Across Domains
Residual RL has demonstrated superior performance across a spectrum of domains:
- Robotic manipulation under contact, friction, and hardware imperfections: 80–100% success in challenging block insertion or assembly tasks is achieved with only thousands of real-world samples, a 3–5× data efficiency gain over RL from scratch (Johannink et al., 2018, Ankile et al., 23 Sep 2025, Ankile et al., 23 Jul 2024).
- High-dimensional robot tracking: Decoupled, parallelized, data-informed residual RL achieves real-time, robust control on 2–7 DoF manipulators, outperforming both RL-from-scratch and purely model-based approaches by wide margins (Li et al., 2021).
- Autonomous vehicles and traffic control: Knowledge-informed, model-based residual RL enables rapid dissipation of stop-and-go waves, superior sample efficiency, mobility, and traffic smoothness, compared to model-free RL baselines (Sheng et al., 30 Aug 2024).
- Aerial robotics and quadcopter flight: Learning a residual policy atop a cascaded controller yields over 40% gap closure to CFD-based compensation controllers, even without explicit aerodynamic models or inter-vehicle communication (Zhang et al., 20 Sep 2024).
- Legged locomotion and MPC+RL: Blended MPC-residual architectures expand the achievable command envelope (velocity domains up to +78%), improve asymptotic reward by 20–30%, and realize zero-shot adaptation to new gaits and terrains (Jeon et al., 14 Oct 2025).
- Safe RL under model uncertainty: The synthesis of residual learning with control-barrier functions and disturbance observers yields RL agents capable of safe exploration even under significant model-bias and process disturbances, outperforming baseline safe-RL approaches (Kalaria et al., 9 Oct 2024).
- Skill transfer and hierarchical RL: Residual correction of skill-space actions accelerates exploration in skill-based RL, leads to 5× fewer samples for convergence, and maintains adaptability under task variations not present in demo data (Rana et al., 2022).
An illustrative table of selected empirical results:
| Domain | Residual RL Gain | Sample Complexity Benefit |
|---|---|---|
| Real block assembly (Johannink et al., 2018) | 80–100% success vs. 0% (PID) | 8k steps vs. >20k (pure RL) |
| Humanoid dexterous hand (Ankile et al., 23 Sep 2025) | 64% (residual RL), 14% (BC base) | ~0.3M vs. ≥40M (on-policy) |
| Trajectory tracking (Li et al., 2021) | Milliradian error | Parallel critic, 2× faster |
| Model-based CAV traffic (Sheng et al., 30 Aug 2024) | +18% mobility over model-free | 5× faster convergence |
| Quadrotor proximity flight (Zhang et al., 20 Sep 2024) | 29%/44% (pos/attitude error drop) | Real-time, 15min wall-time |
6. Extensions, Limitations, and Open Questions
Several advanced directions and limitations have been explored:
- Feedback-based residuals: In cases where simply adding actions is ineffective (e.g., controllers with aggressive internal feedback), the residual policy should act by altering reference or feedback signals rather than the raw action, to avoid a direct "fight" with the base controller (Ranjbar et al., 2021).
- Sim-to-real gap: Learning residuals on top of a robust controller or a sim-trained base policy enables extremely low sim-to-real transfer losses (≤5%) (Johannink et al., 2018, Huang et al., 2 Aug 2025).
- Residual mixing and stability: For some robust controllers (e.g., ), the fraction of control given to the residual must be bounded to retain input-to-state stability, and explicit formulae for these bounds are available (Zuo et al., 2023).
- Expressive limitation: Residual policies tend to provide only local corrections; discovering entirely novel global task strategies outside of the base policy’s support is nontrivial (Ankile et al., 23 Sep 2025).
- Tuning and interpretability: The division of responsibility between the base and residual policy (e.g., selection of the scaling parameter ) can affect both learning dynamics and the interpretability of the final policy (Jeon et al., 14 Oct 2025).
- Theory and guarantees: While Lyapunov and input-to-state stability results exist for constrained residuals, sharp theoretical guarantees under arbitrary base/residual blends, high-dimensional stochastic or partially observed settings, or when base policies are themselves suboptimal or degrade over time, remain under-explored.
7. Historical Development and Cross-Domain Influences
The origins of residual RL date to residual gradient methods in value-approximation (Zhang et al., 2019), but the modern form, as additive correction to fixed or pretrained base controllers, was cemented in robotics (block assembly, frictional contacts, etc.) (Johannink et al., 2018), and later adopted for a wide spectrum of problems—demonstration-driven RL (Alakuijala et al., 2021), robust/optimal control (Zuo et al., 2023, Sheng et al., 30 Aug 2024), hierarchical/skill-based RL (Rana et al., 2022), adaptive tracking (Li et al., 2021), safe RL (Kalaria et al., 9 Oct 2024), and model-based RL in dynamic environments (Nakhaei et al., 12 Jun 2024).
The field continues to evolve: current research focuses on advancing residual architectures for high-dimensional systems, robust and safe RL under complex uncertainty, automated context adaptation, and scalable integration with planning, imitation, and causal RL frameworks.