Knowledge-Informed Model-based Residual RL

Updated 8 October 2025

Knowledge-informed Model-based Residual RL is a framework that integrates expert priors with learned residuals to boost data efficiency and safety.
It employs a two-layer structure where a known baseline policy is augmented by a residual correction learned from environmental deviations.
Empirical results show faster convergence and improved control in domains such as traffic management, robotics, and grid regulation.

Knowledge-informed Model-based Residual Reinforcement Learning (RRL) is a class of reinforcement learning (RL) methods that strategically integrates external knowledge—typically in the form of domain models, expert priors, or symbolic reasoning—with model-based and residual learning paradigms. These methods are distinguished by their layered structure: a prior or explicit (often imperfect) model furnishes the foundation for both modeling the environment and initializing the agent policy, while learning proceeds by estimating a residual component that incrementally closes the performance gap left by the prior, thus enabling data-efficient adaptation, robustness to unmodeled dynamics, and improved generalization. The following sections synthesize the formal, algorithmic, and empirical underpinnings of this field.

1. Principles of Knowledge Integration in Model-based Residual RL

Knowledge-informed model-based RRL combines three canonical ideas: 1) incorporation of prior knowledge, 2) explicit or learned environment models, and 3) residual learning atop a baseline. The integration can be instantiated at both the world-model and policy layers.

Policy Structure

The standard control law augments a known or expert-derived policy πH (such as a physics-based controller, rules, or a learned baseline policy) with a learned corrective term πθ:

$\pi(s) = \pi_H(s) + \pi_\theta(s)$

Here, $\pi_H$ may be deterministic, stochastic, hand-crafted, or the output of a previous model-based (e.g., model predictive control—MPC) or data-driven process. π_θ is parameterized by a function approximator (typically a neural network).

World Model Structure

The system’s transition dynamics may be captured similarly by decomposing the next-state prediction as

$\hat{T}_{\psi, \phi}(s, a) = \hat{T}_{\psi}(s, a) + \Delta_{\phi}(s, a)$

with $\hat{T}_{\psi}$ representing the known expert model (e.g., the Intelligent Driver Model for traffic), and $\Delta_\phi$ a residual dynamics function (neural network) trained to fit systematic deviations or unmodeled behaviors (Sheng et al., 30 Aug 2024).

This decomposition ensures that well-established prior knowledge is retained and refined only where necessary.

2. Algorithms and Theoretical Frameworks

Implementations of knowledge-informed model-based RRL typically instantiate the above structure in a model-based RL pipeline, using offline or simulation-based rollouts, truncated planning, and actor–critic updates with tailored residual corrections (Möllerstedt et al., 2022, Zhang et al., 2019).

Residual Policy Learning

Residual Policy Learning (RPL) (Silver et al., 2018) is a canonical approach that learns a function $f_\theta$ to augment a non-differentiable or heuristic baseline policy π:

$\pi_\theta(s) = \pi_{\text{init}}(s) + f_\theta(s)$

The network is initialized so that $f_\theta$ is effectively zero at the beginning of training to preserve the initial safety or task-competence of the baseline (Möllerstedt et al., 2022, Silver et al., 2018, Ceola et al., 26 Jan 2024). Only the residual is updated during optimization: gradients are back-propagated solely through $f_\theta$ , providing modularity and stability.

Model-based Residual RL

In model-based settings, trajectory rollouts are generated not only with real-world transitions but also via the combined transition model $\hat{T}_{\psi, \phi}$ , boosting sample efficiency by allowing synthetic data to be used for policy and value updates (Möllerstedt et al., 2022, Sheng et al., 30 Aug 2024). Policy evaluation steps are performed using a mixture of real and model-generated transitions, and regularization terms or KL-divergence bounds are used to ensure stability as the residual diverges from the prior (Möllerstedt et al., 2022).

Safety and Robustness

Safety is often addressed via the use of control barrier functions (CBFs), augmented with residual model learning and disturbance observers, where the safety constraints are enforced using both the nominal (known) dynamics and the learned correction terms (Kalaria et al., 9 Oct 2024). The control action is selected by projecting the RL-suggested input onto the set that satisfies the robust CBF constraint, thus guaranteeing constraint satisfaction even in the presence of disturbances and model uncertainties.

3. Performance Benefits and Empirical Evidence

A consistent empirical finding is vastly improved sample efficiency and initial performance when compared to flat model-free RL, especially in safety-critical, partially observable, or high-dimensional domains:

In traffic control, the knowledge-informed model-based RRL (using IDM as π_H and $\hat{T}_\psi$ , plus neural residuals) achieved faster convergence, improved traffic flow stability, and higher mobility compared to baseline TRPO/PPO/SAC agents trained from scratch (Sheng et al., 30 Aug 2024).
In antenna tilt optimization, model-based residual RL initialized from a hand-engineered baseline policy converged in roughly 5,500 steps versus 9,000 steps (SAC baseline) in a realistic telecom setting, with better initial performance and bounded risk (Möllerstedt et al., 2022).
In high-dimensional manipulation, residual RL using a pre-trained DRL policy as the prior achieved a five-fold speedup in sample efficiency for multi-fingered grasping on the iCub robot (Ceola et al., 26 Jan 2024).
Applications in grid control showed that an RRL agent reached the desired voltage regulation performance 7× faster than vanilla RL, and minimized unwanted power curtailment by relying on the structured SDC prior (Bouchkati et al., 24 Jun 2025).

Tables summarizing sample complexities or performance metrics consistently show the hybrid approach dominating traditional RL under comparable conditions.

Domain	Initialization	Sample Efficiency	Final Performance Improvement
Traffic flow (CAV control)	IDM + PI controller	3–5× faster convergence	Lower speed variance
Multi-fingered grasping (iCub)	Pretrained DRL policy	~5× fewer steps to success	Equal or higher grasp rate
Power grid voltage regulation	Modified SDC	~7× fewer training steps	Maximal power utilization
Antenna tilt (telecom)	Rule-based baseline	30–40% fewer interactions	Comparable or better KPIs

4. Knowledge Representations and Abstraction

Knowledge may be provided in diverse forms:

Analytical and Physics-based Models

Classical domain knowledge is frequently injected as analytical models (e.g., IDM for car-following (Sheng et al., 30 Aug 2024), rule-based SDC for voltage control (Bouchkati et al., 24 Jun 2025), parametric LQRs (Wang et al., 2023), hand-designed robotics controllers (Möllerstedt et al., 2022, Silver et al., 2018)). These models are typically robust, interpretable, and encode average-case dynamics, but they miss rare, uncertain, or context-dependent behaviors.

Symbolic and Logical Knowledge

Symbolic reasoning and declarative representations (such as reward machines, P-log, or domain-specific languages like RLang) can also be fused with RL agents (Yu et al., 2023, Rodriguez-Sanchez et al., 2022, Lu et al., 2018). These approaches formalize transition or reward abstraction, hierarchical options, and domain constraints, grounding RL in partial world models or task hierarchies and supporting construction of task-specific MDPs from combined statistical and logical information.

Abstraction and Latent Factors

Abstraction via hierarchical state representations or knowledge graphs (e.g., using class hierarchies in commonsense reasoning tasks (Höpner et al., 2022)) enables generalization across tasks and object classes. Parallel representation of perceptual and semantic information, typically via learned modules, supports disentanglement of the affordances underlying observed behaviors, further facilitating residual learning with structured priors (Schnürer et al., 2022).

5. Residual Learning in Partially Observable and High-dimensional Systems

Partially observable environments present a unique challenge to classical RL. Knowledge-informed RRL ameliorates this in two ways:

By augmenting the agent’s observation with inferred or simulated states from learned or expert models (such as combining IoAlergia-learned automata states with immediate observations) (Muskardin et al., 2022).
By leveraging recurrent, transformer-based, or graph-based neural architectures that exploit known system regularities (for example, Local Shared Linear layers and Transformer-encoders for distributed grid voltage control (Bouchkati et al., 24 Jun 2025)).

Furthermore, in scenarios relying on visual and proprioceptive input, such as dexterous manipulation, residual RL frameworks exploit demonstration data to learn feature extractors or base policies and correct only sparse, high-dimensional errors (Alakuijala et al., 2021, Ceola et al., 26 Jan 2024).

6. Current Challenges and Future Research Directions

Despite strong quantitative results, key challenges remain:

Determining the optimal balance between prior knowledge and learned residuals, both at the model and policy levels. Excessive reliance on the prior risks underfitting, while too much residual flexibility negates the benefits of knowledge injection (Möllerstedt et al., 2022).
Guaranteeing safety and stability, particularly when prior models are imperfect or when exploring in the presence of unmodeled disturbances; robust CBFs with disturbance observers represent a current best practice (Kalaria et al., 9 Oct 2024).
Automated and scalable abstraction: learning transferable intermediate representations (hierarchical, symbolic, object-centric, or latent) that support knowledge transfer remains an active research direction (Wernsdorfer et al., 2014, Höpner et al., 2022, Schnürer et al., 2022).
Integrating logical reasoning and symbolic knowledge in deep RL systems at scale, particularly for real-world robotic and infrastructure domains (Lu et al., 2018, Yu et al., 2023, Rodriguez-Sanchez et al., 2022).

Open avenues include meta-learning for rapid task adaptation, sim-to-real transfer in robotics and traffic, scalable symbolic-connectionist integration, and principled error estimation for both model and policy residuals.

7. Representative Applications and Empirical Domains

Knowledge-informed model-based residual RL methods have demonstrated efficacy across a range of domains:

Connected autonomous vehicle (CAV) trajectory control for mitigating stop-and-go waves in traffic (Sheng et al., 30 Aug 2024).
Distributed voltage regulation in active power grids with PV inverters (Bouchkati et al., 24 Jun 2025).
Antenna tilt optimization in telecommunication infrastructure (Möllerstedt et al., 2022).
Multi-fingered robotic grasping and dexterous manipulation (Ceola et al., 26 Jan 2024, Alakuijala et al., 2021).
Commonsense RL and transfer via knowledge graph abstraction (Höpner et al., 2022).
RL for partially observable robotic and symbolic environments, leveraging learned automata or explicit memory modules (Muskardin et al., 2022).

These applications highlight the value of fusing domain knowledge—whether analytic, symbolic, or based on demonstration—into the policy/model learning pipeline, achieving improved sample efficiency, robustness, and higher asymptotic performance compared to model-free or naively model-based approaches.

In summary, Knowledge-informed Model-based Residual Reinforcement Learning unites the strengths of model-based planning, residual correction, and rich domain priors, yielding a flexible and theoretically grounded framework that is empirically validated across robotics, control, grid management, and cognitive domains. The integration of knowledge at the model, abstraction, and policy layers underpins recent gains in the scalability, safety, and generalizability of reinforcement learning in complex, real-world environments.