RL-Augmented Adaptive MPC

Updated 27 March 2026

RL-augmented Adaptive MPC is a control framework that combines model predictive control’s receding horizon optimization with online reinforcement learning for dynamic adaptation.
It employs RL to tune critical parameters—including constraint tightening, prediction horizon, and cost function shaping—for improved disturbance rejection and reduced tracking error.
Empirical studies in domains such as autonomous driving and robotics show significant performance gains while maintaining safety through strict MPC constraint enforcement.

Reinforcement Learning (RL)-augmented Adaptive Model Predictive Control (MPC) refers to a family of control frameworks wherein reinforcement learning methods are employed to dynamically tune, augment, or hybridize the operation of model predictive controllers, enabling contextual adaptation and improved closed-loop performance in settings with model uncertainty, time-varying perturbations, or high-dimensional nonlinear dynamics. This approach traverses the intersection of model-based planning (MPC) and data-driven, policy-based adaptation (RL), providing a mechanism to optimize constraint handling, feasibility, disturbance rejection, and sample efficiency across domains such as autonomous driving, robotics, combustion processes, and human-robot collaboration.

1. Core Principles and Mathematical Formulation

RL-augmented adaptive MPC combines a nominal MPC—typically designed with fixed model parameters and receding-horizon optimization—with online RL-based modules that adapt critical controller hyperparameters, system dynamics surrogates, cost functions, or constraints based on observed system context and performance signals.

A canonical formulation appears in RL-adaptive Stochastic Nonlinear MPC (aSNMPC) (Zarrouki et al., 2023), in which the control problem is posed as:

$\min_{x(\cdot), u(\cdot)}\; J = \int_0^{T_p} \ell(\mathbb{E}[x(\tau)], u(\tau)) d\tau + m(\mathbb{E}[x(T_p)])$

subject to nominal dynamics, hard and chance constraints, and, crucially, robustification of probabilistic constraints controlled by tunable parameters (e.g., robustification factor $\kappa$ and uncertainty propagation horizon $T_u$ ). RL agents are then tasked with determining these parameters online, based on system observations, predicted disturbances, and reference trajectories.

In adaptive-horizon settings (Bøhn et al., 2021), RL learns to select the prediction horizon $N_k$ at each time step, representing the depth of future planning as a function of system state.

In RL-robust NMPC (Esfahani et al., 2021), the RL module tunes parameters of uncertainty bounding sets (e.g., covariance matrices for ellipsoidal tubes), directly shaping the trade-off between robustness and conservatism in the MPC problem.

2. RL Policy Architecture and Integration

The integration strategy depends on the granularity and locus of adaptation:

Parameter Adaptation: RL agents, typically parameterized by multilayer perceptrons (MLPs) or recurrent networks, output hyperparameters governing the degree of constraint tightening ( $\kappa$ ), how far ahead to propagate uncertainty ( $T_u$ ), or horizon length ( $N$ ). The observation vector encompasses recent performance statistics (e.g., tracking error, infeasibility flags), disturbance estimates, and the upcoming reference trajectory (Zarrouki et al., 2023).
Model Adaptation: In recurrent RL-infused MPC (Zhang et al., 2023), recurrent policies (LSTMs) output time-varying internal model parameters from the history of observed states and actions, allowing rapid adaptation to parameter drifts or changing environments.
Terminal Cost and Value Augmentation: RL is employed to learn tail value functions (e.g., $Q$ -functions) that approximate the infinite-horizon cost beyond the planning window, which are then used as terminal penalties in the MPC optimization (Kovalev et al., 2023).
Residual or Correction Policy: RL policies output residuals to adjust planned actions from a tractable surrogate MPC, compensating for model mismatch or unmodeled effects (e.g., footstep adjustments in bipedal locomotion (Bang et al., 2024), force and trajectory compensation in quadruped locomotion (Chen et al., 2023, Kamohara et al., 22 Sep 2025)).
Switching/Fusion Mechanisms: Adaptive control authority allocation is realized via learned neural switchers that blend or switch between analytic MPC and RL policies based on risk metrics, feasibility, or proximity to constraints (Liu et al., 23 Jan 2026).

The RL agent is invoked at a fixed or event-based interval to compute its adaptive output, which is then injected into the next MPC optimization cycle, creating a tightly coupled closed-loop system.

3. Learning Objectives and Training Paradigms

The training of RL modules embedded within adaptive MPC is shaped to maintain safety and constraint satisfaction:

Reward Design: Task rewards target tracking error, feasibility, and computational cost, often penalizing constraint violations sharply to enforce safe exploration (Zarrouki et al., 2023, Bøhn et al., 2021). Combined objectives may include terms for model identification error (system ID loss), energy efficiency, or specific task-oriented penalties (e.g., comfort zone, collision avoidance, torque/force smoothness).
RL Algorithms: Proximal Policy Optimization (PPO) and Soft Actor-Critic (SAC) are prevalent for policy optimization, due to their sample efficiency and stability under continuous control settings (Zarrouki et al., 2023, Bang et al., 2024, Kamohara et al., 22 Sep 2025). Off-policy or on-policy updates are employed depending on the required feedback frequency and computational constraints.
Empirical Stability: While formal Lyapunov or regret guarantees are uncommon, empirical closed-loop tests under severe disturbances, mismodeled dynamics, or task generalization probe the robustness of the integrated algorithm (Zarrouki et al., 2023, Zhang et al., 2023).

Typically, constraints present in the model-predictive optimization are never relaxed by the RL agent; adaptation occurs strictly over parameters or references, protecting against unsafe RL exploration (Bedei et al., 23 Apr 2025).

4. Representative Applications and Domains

Table 1 summarizes key applications and RL adaptation loci. Further context is provided in subsequent paragraphs.

Application Domain	RL Adaptation Target	Notable Papers
Autonomous Vehicle Motion Control	Robustification, UPH	(Zarrouki et al., 2023)
Adaptive Driving (CARLA)	Dynamics model params	(Zhang et al., 2023)
Bipedal/Quadrupedal Locomotion	Dynamics, swing, gait	(Chen et al., 2023, Kamohara et al., 22 Sep 2025, Bang et al., 2024)
H2-diesel Engine Control	Reference trajectory offset	(Bedei et al., 23 Apr 2025)
Collaborative Robot Navigation	Policy/safety fusion	(Liu et al., 23 Jan 2026)

In autonomous driving, RL-augmented aSNMPC enables anticipatory adaptation of key robustness parameters, yielding reduced lateral deviation and improved feasibility under dynamic, uncertain disturbances (Zarrouki et al., 2023). In bipedal and quadrupedal locomotion, residual RL policies add corrections to MPC outputs to achieve agile behaviors over irregular terrain, enable rapid turning, and synchronize stance-swing behaviors (Chen et al., 2023, Kamohara et al., 22 Sep 2025, Bang et al., 2024). Industrial control applications (e.g., combustion engines) leverage RL to adapt MPC reference tracking in the presence of sensor/actuator drift, with the safety of MPC-enforced hardware constraints (Bedei et al., 23 Apr 2025). In collaborative robotics, ARMS utilizes a neural switcher for adaptive fusion between RL agility and MPC safety in dynamic, cluttered environments (Liu et al., 23 Jan 2026).

5. Empirical Results, Performance, and Limits

Performance gains: RL-augmented adaptive MPC schemes demonstrated significant improvements over static or nominal MPC, including 18–39% reductions in tracking error and full restoration of feasibility under strong disturbances (Zarrouki et al., 2023). On physically realistic simulation platforms (CARLA, PyBullet, IsaacLab), adaptive frameworks maintained robustness under large parameter drifts, load uncertainty, unforeseen environmental transitions, or adversarial terrain (Zhang et al., 2023, Chen et al., 2023, Kamohara et al., 22 Sep 2025).
Sample efficiency: RL adaptation converged to high-performing policies with manageable data utilization (e.g., $10^6$ – $10^7$ samples, $\sim$ 10–25 minutes of data in simple systems, or a few days in high-dimensional robotics) (Bøhn et al., 2021, Bang et al., 2024).
Real-time feasibility: Controllers achieved low-latency inference (1–10 ms per step) compatible with tight control loops on embedded hardware (Chen et al., 2023, Kovalev et al., 2023).
Robustness and generalization: High transferability of learned policies and RL-MPC hybrids was observed across task instances or robot platforms with minimal adjustment, provided the underlying MPC remains model-consistent and the RL adaptation module is modular (Chen et al., 2023, Kamohara et al., 22 Sep 2025).
Safety and guarantees: By design, constraints remain strictly enforced by the MPC at all times; RL acts only through parameterization or reference shaping, not direct actuation, precluding unsafe exploratory actions (Bedei et al., 23 Apr 2025, Liu et al., 23 Jan 2026). Formal theoretical guarantees remain limited to the constraint properties of the underlying MPC, while the additional RL layer is empirically validated for stability but lacks comprehensive Lyapunov proofs.

6. Limitations, Engineering Trade-offs, and Future Directions

Scope of Adaptation: Many implementations restrict RL to a subset of MPC’s configuration space—such as reference signals, constraint margins, horizon length, or surrogate value functions—rather than full state feedback or direct action commands. This preserves MPC structure and guarantees but may limit adaptation potential in highly nonstationary environments (Bedei et al., 23 Apr 2025, Esfahani et al., 2021).
Model and Hardware Gaps: Controller performance relies on the representational fidelity of the underlying MPC model; severe model-plant mismatch may necessitate richer action spaces or deeper RL integration (e.g., dynamic cost reweighting, direct constraint adaptation) (Bedei et al., 23 Apr 2025).
Theoretical Guarantees: Formal global stability, safety, and constraint satisfaction analysis for RL-augmented adaptive MPC is an open topic. Most current work provides empirical results and references to the underlying MPC’s theoretical properties but lacks a full Lyapunov or control-barrier-function-based safety proof for the hybrid loop (Zarrouki et al., 2023, Liu et al., 23 Jan 2026).
Sample Efficiency and Complexity: Complex architectures (recurrent RL, high-dimensional observation spaces) increase sample and computational demands, necessitating careful architectural and training specification to achieve tractable adaptation (Zhang et al., 2023, Kamohara et al., 22 Sep 2025).
Suggested Research Directions: Expanding the RL action space to cost function shaping, constraint modulus adaptation, or richer trajectory parameterization; formalizing safety layering (e.g., tube-MPC, control barrier filters) for Lyapunov-style guarantees; exploiting domain randomization and online identification for robustness to plant/model drift (Bedei et al., 23 Apr 2025, Kamohara et al., 22 Sep 2025, Bøhn et al., 2021).

7. Concluding Remarks

RL-augmented adaptive MPC synthesizes principled model-based receding-horizon optimization and experience-driven, context-sensitive policy adaptation in a modular framework. Empirical evidence across diverse tasks exhibits marked improvements in robustness, feasibility, agility, and sample efficiency, while retaining or augmenting the constraint satisfaction guarantees of MPC. The general framework applies well in domains with partially known or variable dynamics, frequent disturbances, or structure-constrained control environments. Forthcoming advances are anticipated in scaling safety assurances, extending adaptation to a broader class of controller parameters, and deepening sample efficiency for deployment in computationally and safety-critical real-world systems (Zarrouki et al., 2023, Bedei et al., 23 Apr 2025, Liu et al., 23 Jan 2026, Kamohara et al., 22 Sep 2025, Bang et al., 2024, Chen et al., 2023, Kovalev et al., 2023, Zhang et al., 2023, Esfahani et al., 2021, Bøhn et al., 2021).