Integrated MPC-RL Framework

Updated 27 March 2026

The paper presents an integrated control framework that fuses MPC's constraint handling with RL's adaptive policy search for event-triggered decision making.
It employs a hierarchical policy to jointly tune meta-parameters such as prediction horizon, cost weights, and event triggers, balancing performance and computation.
Empirical results report a 36% reduction in MPC computation time and an 18.4% improvement in control performance, underscoring its practical efficacy.

An integrated Model Predictive Control–Reinforcement Learning (MPC–RL) framework is an advanced closed-loop control architecture that combines classic MPC—known for constraint handling and model-based prediction—with reinforcement learning, which enables adaptive policy search, end-to-end performance optimization, and tuning of parameters that influence control structure, operation, and computational efficiency. Such hybrid approaches aim to simultaneously optimize control performance, resource utilization, and computational cost by leveraging both model-based and data-driven learning principles in a unified framework (Bøhn et al., 2021).

1. Architecture and Control Workflow

The integrated MPC–RL framework exposes meta-parameters of the MPC controller (such as prediction horizon, cost-function weights, and event-triggering thresholds) as decision variables for the RL agent. At each plant step $t$ , the RL-based control loop operates as a hierarchical, event-triggered decision process:

The agent observes an augmented state $s_t = [\bar x_i, \hat p_i, N_i, \bar x_t, \hat p_t, t-i]^\top$ , encoding the last OCP solve, current state and input, and the “age” since the last solve.
It samples a binary decision $c_t \sim \textrm{Bernoulli}(w_t)$ (“recompute MPC?”), where $w_t=\sigma\bigl(\pi^{c}_\theta(s_t)\bigr)$ and $\sigma$ is the logistic sigmoid.
If $c_t=1$ , it samples a prediction horizon $N_t\in\{N_{\min},\ldots,N_{\max}\}$ (from policy $\pi^{N}_\theta$ ), solves the MPC open-loop optimal control problem (OCP), and applies the first MPC input $u_t = u_0^\textrm{MPC}$ , updating all predicted trajectories and LQR gains.
If $c_t=0$ , it applies $s_t = [\bar x_i, \hat p_i, N_i, \bar x_t, \hat p_t, t-i]^\top$ 0, using previously planned MPC actions with corrective LQR gains computed by linearizing around the last MPC trajectory.
All four policy components $s_t = [\bar x_i, \hat p_i, N_i, \bar x_t, \hat p_t, t-i]^\top$ 1 share parameters $s_t = [\bar x_i, \hat p_i, N_i, \bar x_t, \hat p_t, t-i]^\top$ 2 and are jointly optimized via PPO.

This structure enables the RL policy to govern both discrete (event-trigger and horizon) and continuous (input selection) meta-parameters, with the event-trigger logic adaptively trading off re-optimization frequency and control performance (Bøhn et al., 2021).

2. Meta-Parameter Optimization and Policy Representation

Key meta-parameters treated as RL actions include:

Prediction Horizon $s_t = [\bar x_i, \hat p_i, N_i, \bar x_t, \hat p_t, t-i]^\top$ 3, controlling the look-ahead window and impacting both MPC performance and computational load.
MPC Cost-function Weights $s_t = [\bar x_i, \hat p_i, N_i, \bar x_t, \hat p_t, t-i]^\top$ 4, shaping stage cost and control priorities—these also define the LQR feedback law between solves.
Event-trigger Threshold $s_t = [\bar x_i, \hat p_i, N_i, \bar x_t, \hat p_t, t-i]^\top$ 5, mapped to the Bernoulli parameter of $s_t = [\bar x_i, \hat p_i, N_i, \bar x_t, \hat p_t, t-i]^\top$ 6.

A hierarchical mixture-distribution policy factorizes as

$s_t = [\bar x_i, \hat p_i, N_i, \bar x_t, \hat p_t, t-i]^\top$ 7

where $s_t = [\bar x_i, \hat p_i, N_i, \bar x_t, \hat p_t, t-i]^\top$ 8 and $s_t = [\bar x_i, \hat p_i, N_i, \bar x_t, \hat p_t, t-i]^\top$ 9 select the structure of the OCP, and the continuous control $c_t \sim \textrm{Bernoulli}(w_t)$ 0 is sampled from a Gaussian whose mean is determined by either the new MPC solution (if $c_t \sim \textrm{Bernoulli}(w_t)$ 1) or the composite MPC-LQR feedback (if $c_t \sim \textrm{Bernoulli}(w_t)$ 2). This policy parametrization ensures efficient exploration in both discrete and continuous spaces (Bøhn et al., 2021).

3. Optimal Control Problem and Dual-Mode Operation

At each event-triggered recomputation ( $c_t \sim \textrm{Bernoulli}(w_t)$ 3), the MPC OCP has the generic form:

$c_t \sim \textrm{Bernoulli}(w_t)$ 4

For linear-quadratic cases, stage and terminal costs reduce to $c_t \sim \textrm{Bernoulli}(w_t)$ 5, $c_t \sim \textrm{Bernoulli}(w_t)$ 6.

In between OCP solutions, the control law switches to a dual-mode architecture, applying $c_t \sim \textrm{Bernoulli}(w_t)$ 7 where $c_t \sim \textrm{Bernoulli}(w_t)$ 8 is computed by linearizing the MPC dynamics along the predicted trajectory (Bøhn et al., 2021).

4. Reinforcement Learning Formulation

The RL Markov Decision Process is defined as:

State: $c_t \sim \textrm{Bernoulli}(w_t)$ 9 includes all plant states and OCP “history” variables.
Action: $w_t=\sigma\bigl(\pi^{c}_\theta(s_t)\bigr)$ 0 draws both discrete recomputation/horizon and continuous MPC or LQR perturbations.
Reward:

$w_t=\sigma\bigl(\pi^{c}_\theta(s_t)\bigr)$ 1

with $w_t=\sigma\bigl(\pi^{c}_\theta(s_t)\bigr)$ 2 imposing episode-ending penalties for constraints and $w_t=\sigma\bigl(\pi^{c}_\theta(s_t)\bigr)$ 3 penalizing time and effort spent in MPC computation.

End-to-end training is performed with Proximal Policy Optimization (PPO), updating all policy parameters through advantage-weighted policy gradients (Bøhn et al., 2021).

5. Event-Triggered Computation and Computational Efficiency

The event-triggered mechanism, governed by the Bernoulli policy’s logit, determines when to recompute the MPC solution. When $w_t=\sigma\bigl(\pi^{c}_\theta(s_t)\bigr)$ 4, the OCP is re-solved at a chosen horizon; when $w_t=\sigma\bigl(\pi^{c}_\theta(s_t)\bigr)$ 5, the controller relies on the shifted MPC plan and associated LQR gain. The RL agent thus learns to:

Invoke long horizons when necessary (e.g., instability, risky regions).
Avoid unnecessary computation when the system is close to nominal or can be stabilized cheaply by LQR.

Empirical evidence on the inverted pendulum task showed a $w_t=\sigma\bigl(\pi^{c}_\theta(s_t)\bigr)$ 6 reduction in total MPC computation time (fewer OCP solves, longer intervals between solves) and an $w_t=\sigma\bigl(\pi^{c}_\theta(s_t)\bigr)$ 7 improvement in control performance compared to the best fixed-horizon, always-recompute MPC baseline, validating the computational and performance synergy of the approach (Bøhn et al., 2021).

6. Training Procedure and Practical Implementation

The control framework is implemented with multiple parallel actors operating under PPO, with episodes terminated on constraint violation or after $w_t=\sigma\bigl(\pi^{c}_\theta(s_t)\bigr)$ 8 steps. Training involves the following loop:

Form state $w_t=\sigma\bigl(\pi^{c}_\theta(s_t)\bigr)$ 9, sample decision $\sigma$ 0.
If $\sigma$ 1, sample $\sigma$ 2, solve MPC, apply $\sigma$ 3.
If $\sigma$ 4, apply the next input from the last MPC plan plus LQR correction.
Store transitions $\sigma$ 5.
After $\sigma$ 6 steps, batch PPO updates are performed on the collected trajectories.

Key hyperparameters: $\sigma$ 7, learning rate $\sigma$ 8, minibatch size $\sigma$ 9, PPO clip $c_t=1$ 0, value-loss coefficient $c_t=1$ 1, no entropy bonus.

Efficient real-time application is ensured by embedding the RL policy evaluation (for the event trigger and horizon) immediately before each control/optimization step, leveraging fast warm-started quadratic/convex solvers whose computational cost scales linearly in $c_t=1$ 2 (Bøhn et al., 2021).

7. Significance, Generality, and Empirical Results

The integrated MPC–RL scheme establishes a flexible and computationally scalable paradigm for algorithmic tuning of predictive controllers. It:

Automatically selects the sequence of OCP solves (frequency, horizon, and associated cost weights) for each plant state, trading off performance and computation.
Recovers significant gains over both naive trial-and-error MPC tuning and fixed-parameter deployments, as shown by substantial cost reduction and compute savings.
Is readily extensible: the mixture-policy and meta-parameter RL setup apply to any prediction-based controller with tunable horizon, weights, and event logic.

The approach has been demonstrated to reduce total closed-loop cost by $c_t=1$ 3 and MPC computation time by $c_t=1$ 4 on well-established control benchmarks, offering a blueprint for future adaptive and resource-aware controllers in embedded and fast real-time environments (Bøhn et al., 2021).

References:

“Optimization of the Model Predictive Control Meta-Parameters Through Reinforcement Learning” (Bøhn et al., 2021)

Markdown Report Issue Upgrade to Chat

References (1)

Optimization of the Model Predictive Control Meta-Parameters Through Reinforcement Learning (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Integrated Model Predictive Control–Reinforcement Learning (MPC-RL) Framework.