Integrated MPC-RL Framework
- The paper presents an integrated control framework that fuses MPC's constraint handling with RL's adaptive policy search for event-triggered decision making.
- It employs a hierarchical policy to jointly tune meta-parameters such as prediction horizon, cost weights, and event triggers, balancing performance and computation.
- Empirical results report a 36% reduction in MPC computation time and an 18.4% improvement in control performance, underscoring its practical efficacy.
An integrated Model Predictive Control–Reinforcement Learning (MPC–RL) framework is an advanced closed-loop control architecture that combines classic MPC—known for constraint handling and model-based prediction—with reinforcement learning, which enables adaptive policy search, end-to-end performance optimization, and tuning of parameters that influence control structure, operation, and computational efficiency. Such hybrid approaches aim to simultaneously optimize control performance, resource utilization, and computational cost by leveraging both model-based and data-driven learning principles in a unified framework (Bøhn et al., 2021).
1. Architecture and Control Workflow
The integrated MPC–RL framework exposes meta-parameters of the MPC controller (such as prediction horizon, cost-function weights, and event-triggering thresholds) as decision variables for the RL agent. At each plant step , the RL-based control loop operates as a hierarchical, event-triggered decision process:
- The agent observes an augmented state , encoding the last OCP solve, current state and input, and the “age” since the last solve.
- It samples a binary decision (“recompute MPC?”), where and is the logistic sigmoid.
- If , it samples a prediction horizon (from policy ), solves the MPC open-loop optimal control problem (OCP), and applies the first MPC input , updating all predicted trajectories and LQR gains.
- If , it applies , using previously planned MPC actions with corrective LQR gains computed by linearizing around the last MPC trajectory.
- All four policy components share parameters and are jointly optimized via PPO.
This structure enables the RL policy to govern both discrete (event-trigger and horizon) and continuous (input selection) meta-parameters, with the event-trigger logic adaptively trading off re-optimization frequency and control performance (Bøhn et al., 2021).
2. Meta-Parameter Optimization and Policy Representation
Key meta-parameters treated as RL actions include:
- Prediction Horizon , controlling the look-ahead window and impacting both MPC performance and computational load.
- MPC Cost-function Weights , shaping stage cost and control priorities—these also define the LQR feedback law between solves.
- Event-trigger Threshold , mapped to the Bernoulli parameter of .
A hierarchical mixture-distribution policy factorizes as
where and select the structure of the OCP, and the continuous control is sampled from a Gaussian whose mean is determined by either the new MPC solution (if ) or the composite MPC-LQR feedback (if ). This policy parametrization ensures efficient exploration in both discrete and continuous spaces (Bøhn et al., 2021).
3. Optimal Control Problem and Dual-Mode Operation
At each event-triggered recomputation (), the MPC OCP has the generic form:
For linear-quadratic cases, stage and terminal costs reduce to , .
In between OCP solutions, the control law switches to a dual-mode architecture, applying where is computed by linearizing the MPC dynamics along the predicted trajectory (Bøhn et al., 2021).
4. Reinforcement Learning Formulation
The RL Markov Decision Process is defined as:
- State: includes all plant states and OCP “history” variables.
- Action: draws both discrete recomputation/horizon and continuous MPC or LQR perturbations.
- Reward:
with imposing episode-ending penalties for constraints and penalizing time and effort spent in MPC computation.
End-to-end training is performed with Proximal Policy Optimization (PPO), updating all policy parameters through advantage-weighted policy gradients (Bøhn et al., 2021).
5. Event-Triggered Computation and Computational Efficiency
The event-triggered mechanism, governed by the Bernoulli policy’s logit, determines when to recompute the MPC solution. When , the OCP is re-solved at a chosen horizon; when , the controller relies on the shifted MPC plan and associated LQR gain. The RL agent thus learns to:
- Invoke long horizons when necessary (e.g., instability, risky regions).
- Avoid unnecessary computation when the system is close to nominal or can be stabilized cheaply by LQR.
Empirical evidence on the inverted pendulum task showed a reduction in total MPC computation time (fewer OCP solves, longer intervals between solves) and an improvement in control performance compared to the best fixed-horizon, always-recompute MPC baseline, validating the computational and performance synergy of the approach (Bøhn et al., 2021).
6. Training Procedure and Practical Implementation
The control framework is implemented with multiple parallel actors operating under PPO, with episodes terminated on constraint violation or after steps. Training involves the following loop:
- Form state , sample decision .
- If , sample , solve MPC, apply .
- If , apply the next input from the last MPC plan plus LQR correction.
- Store transitions .
- After steps, batch PPO updates are performed on the collected trajectories.
Key hyperparameters: , learning rate , minibatch size $256$, PPO clip , value-loss coefficient $0.5$, no entropy bonus.
Efficient real-time application is ensured by embedding the RL policy evaluation (for the event trigger and horizon) immediately before each control/optimization step, leveraging fast warm-started quadratic/convex solvers whose computational cost scales linearly in (Bøhn et al., 2021).
7. Significance, Generality, and Empirical Results
The integrated MPC–RL scheme establishes a flexible and computationally scalable paradigm for algorithmic tuning of predictive controllers. It:
- Automatically selects the sequence of OCP solves (frequency, horizon, and associated cost weights) for each plant state, trading off performance and computation.
- Recovers significant gains over both naive trial-and-error MPC tuning and fixed-parameter deployments, as shown by substantial cost reduction and compute savings.
- Is readily extensible: the mixture-policy and meta-parameter RL setup apply to any prediction-based controller with tunable horizon, weights, and event logic.
The approach has been demonstrated to reduce total closed-loop cost by and MPC computation time by on well-established control benchmarks, offering a blueprint for future adaptive and resource-aware controllers in embedded and fast real-time environments (Bøhn et al., 2021).
References:
- “Optimization of the Model Predictive Control Meta-Parameters Through Reinforcement Learning” (Bøhn et al., 2021)