Papers
Topics
Authors
Recent
Search
2000 character limit reached

Integrated MPC-RL Framework

Updated 27 March 2026
  • The paper presents an integrated control framework that fuses MPC's constraint handling with RL's adaptive policy search for event-triggered decision making.
  • It employs a hierarchical policy to jointly tune meta-parameters such as prediction horizon, cost weights, and event triggers, balancing performance and computation.
  • Empirical results report a 36% reduction in MPC computation time and an 18.4% improvement in control performance, underscoring its practical efficacy.

An integrated Model Predictive ControlReinforcement Learning (MPC–RL) framework is an advanced closed-loop control architecture that combines classic MPC—known for constraint handling and model-based prediction—with reinforcement learning, which enables adaptive policy search, end-to-end performance optimization, and tuning of parameters that influence control structure, operation, and computational efficiency. Such hybrid approaches aim to simultaneously optimize control performance, resource utilization, and computational cost by leveraging both model-based and data-driven learning principles in a unified framework (Bøhn et al., 2021).

1. Architecture and Control Workflow

The integrated MPC–RL framework exposes meta-parameters of the MPC controller (such as prediction horizon, cost-function weights, and event-triggering thresholds) as decision variables for the RL agent. At each plant step tt, the RL-based control loop operates as a hierarchical, event-triggered decision process:

  1. The agent observes an augmented state st=[xˉi,p^i,Ni,xˉt,p^t,ti]s_t = [\bar x_i, \hat p_i, N_i, \bar x_t, \hat p_t, t-i]^\top, encoding the last OCP solve, current state and input, and the “age” since the last solve.
  2. It samples a binary decision ctBernoulli(wt)c_t \sim \textrm{Bernoulli}(w_t) (“recompute MPC?”), where wt=σ(πθc(st))w_t=\sigma\bigl(\pi^{c}_\theta(s_t)\bigr) and σ\sigma is the logistic sigmoid.
  3. If ct=1c_t=1, it samples a prediction horizon Nt{Nmin,,Nmax}N_t\in\{N_{\min},\ldots,N_{\max}\} (from policy πθN\pi^{N}_\theta), solves the MPC open-loop optimal control problem (OCP), and applies the first MPC input ut=u0MPCu_t = u_0^\textrm{MPC}, updating all predicted trajectories and LQR gains.
  4. If ct=0c_t=0, it applies ut=utiMPC+Kti(x^tixˉt)u_t = u_{t-i}^\textrm{MPC} + K_{t-i}(\hat x_{t-i} - \bar x_t), using previously planned MPC actions with corrective LQR gains computed by linearizing around the last MPC trajectory.
  5. All four policy components {πc,πN,πM,πML}\{\pi^c, \pi^N, \pi^M, \pi^{ML}\} share parameters θ\theta and are jointly optimized via PPO.

This structure enables the RL policy to govern both discrete (event-trigger and horizon) and continuous (input selection) meta-parameters, with the event-trigger logic adaptively trading off re-optimization frequency and control performance (Bøhn et al., 2021).

2. Meta-Parameter Optimization and Policy Representation

Key meta-parameters treated as RL actions include:

  • Prediction Horizon Nt{Nmin,...,Nmax}N_t\in\{N_{\min},...,N_{\max}\}, controlling the look-ahead window and impacting both MPC performance and computational load.
  • MPC Cost-function Weights (Q,R)(Q,R), shaping stage cost and control priorities—these also define the LQR feedback law between solves.
  • Event-trigger Threshold δ\delta, mapped to the Bernoulli parameter of πc\pi^c.

A hierarchical mixture-distribution policy factorizes as

πθ(as)=Pc(cs)×PN(Ns)×Pμ(us,N)\pi_\theta(a|s) = P^c(c|s) \times P^N(N|s) \times P^\mu(u|s,N)

where cc and NN select the structure of the OCP, and the continuous control uu is sampled from a Gaussian whose mean is determined by either the new MPC solution (if c=1c=1) or the composite MPC-LQR feedback (if c=0c=0). This policy parametrization ensures efficient exploration in both discrete and continuous spaces (Bøhn et al., 2021).

3. Optimal Control Problem and Dual-Mode Operation

At each event-triggered recomputation (ct=1c_t=1), the MPC OCP has the generic form:

minx0:Nt,u0:Nt1k=0Nt1ρkθM(xk,uk,p^t+k)+ρNtmθM(xNt) s.t.x0=xˉt,xk+1=f^θM(xk,uk,p^t+k), hθM(xk,uk)0,ukU,xkX\begin{aligned} &\min_{x_{0:N_t},\,u_{0:N_t-1}} \sum_{k=0}^{N_t-1}\rho^k\,\ell_{\theta^M}(x_k,u_k,\hat p_{t+k}) + \rho^{N_t}\,m_{\theta^M}(x_{N_t}) \ &\text{s.t.} \quad x_0 = \bar x_t,\quad x_{k+1} = \hat f_{\theta^M}(x_k, u_k, \hat p_{t+k}), \ &\qquad\qquad h_{\theta^M}(x_k, u_k) \le 0, \quad u_k \in \mathcal U,\, x_k \in \mathcal X \end{aligned}

For linear-quadratic cases, stage and terminal costs reduce to (x,u)=xQx+uRu\ell(x,u)=x^\top Qx + u^\top Ru, m(x)=xPxm(x)=x^\top P x.

In between OCP solutions, the control law switches to a dual-mode architecture, applying ut=utiMPC+Kti(x^tixˉt)u_t = u_{t-i}^\mathrm{MPC} + K_{t-i}(\hat x_{t-i} - \bar x_t) where KtiK_{t-i} is computed by linearizing the MPC dynamics along the predicted trajectory (Bøhn et al., 2021).

4. Reinforcement Learning Formulation

The RL Markov Decision Process is defined as:

  • State: sts_t includes all plant states and OCP “history” variables.
  • Action: at=(ct,Nt,utMPC,utML)a_t=(c_t, N_t, u_t^\mathrm{MPC}, u_t^\mathrm{ML}) draws both discrete recomputation/horizon and continuous MPC or LQR perturbations.
  • Reward:

rt=(xˉt+1,ut)λh[# constraint violations at t+1]λcctNtr_t = -\ell(\bar x_{t+1}, u_t) - \lambda_h[\textrm{\# constraint violations at } t+1] - \lambda_c c_t N_t

with λh<0\lambda_h < 0 imposing episode-ending penalties for constraints and λc>0\lambda_c > 0 penalizing time and effort spent in MPC computation.

End-to-end training is performed with Proximal Policy Optimization (PPO), updating all policy parameters through advantage-weighted policy gradients (Bøhn et al., 2021).

5. Event-Triggered Computation and Computational Efficiency

The event-triggered mechanism, governed by the Bernoulli policy’s logit, determines when to recompute the MPC solution. When ct=1c_t=1, the OCP is re-solved at a chosen horizon; when ct=0c_t=0, the controller relies on the shifted MPC plan and associated LQR gain. The RL agent thus learns to:

  • Invoke long horizons when necessary (e.g., instability, risky regions).
  • Avoid unnecessary computation when the system is close to nominal or can be stabilized cheaply by LQR.

Empirical evidence on the inverted pendulum task showed a 36%36\% reduction in total MPC computation time (fewer OCP solves, longer intervals between solves) and an 18.4%18.4\% improvement in control performance compared to the best fixed-horizon, always-recompute MPC baseline, validating the computational and performance synergy of the approach (Bøhn et al., 2021).

6. Training Procedure and Practical Implementation

The control framework is implemented with multiple parallel actors operating under PPO, with episodes terminated on constraint violation or after TT steps. Training involves the following loop:

  • Form state sts_t, sample decision ctc_t.
  • If ct=1c_t=1, sample NtN_t, solve MPC, apply ut=u0MPCu_t = u_0^\mathrm{MPC}.
  • If ct=0c_t=0, apply the next input from the last MPC plan plus LQR correction.
  • Store transitions (st,ct,Nt,ut,rt,st+1)(s_t, c_t, N_t, u_t, r_t, s_{t+1}).
  • After ZZ steps, batch PPO updates are performed on the collected trajectories.

Key hyperparameters: γ=0.99\gamma=0.99, learning rate 3×1043\times 10^{-4}, minibatch size $256$, PPO clip ϵ=0.25\epsilon=0.25, value-loss coefficient $0.5$, no entropy bonus.

Efficient real-time application is ensured by embedding the RL policy evaluation (for the event trigger and horizon) immediately before each control/optimization step, leveraging fast warm-started quadratic/convex solvers whose computational cost scales linearly in NtN_t (Bøhn et al., 2021).

7. Significance, Generality, and Empirical Results

The integrated MPC–RL scheme establishes a flexible and computationally scalable paradigm for algorithmic tuning of predictive controllers. It:

  • Automatically selects the sequence of OCP solves (frequency, horizon, and associated cost weights) for each plant state, trading off performance and computation.
  • Recovers significant gains over both naive trial-and-error MPC tuning and fixed-parameter deployments, as shown by substantial cost reduction and compute savings.
  • Is readily extensible: the mixture-policy and meta-parameter RL setup apply to any prediction-based controller with tunable horizon, weights, and event logic.

The approach has been demonstrated to reduce total closed-loop cost by 21.5%21.5\% and MPC computation time by 36%36\% on well-established control benchmarks, offering a blueprint for future adaptive and resource-aware controllers in embedded and fast real-time environments (Bøhn et al., 2021).


References:

  • “Optimization of the Model Predictive Control Meta-Parameters Through Reinforcement Learning” (Bøhn et al., 2021)
Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Integrated Model Predictive Control–Reinforcement Learning (MPC-RL) Framework.