Parameterized MPC as Policy Actor

Updated 9 May 2026

Parameterized MPC is defined as a differentiable actor that reparameterizes the conventional finite-horizon optimal control law, linking MPC with policy-gradient methods.
It utilizes explicit neural surrogates, hybrid architectures, and differentiable solver unrollings to enable end-to-end gradient-based optimization for safe and efficient control.
These approaches ensure constraint satisfaction and real-time performance in advanced applications such as robotics, automotive systems, and embedded control.

A parameterized Model Predictive Control (MPC) policy as a differentiable actor formalizes the linkage between model-based control and modern policy-based learning and inference. In this paradigm, the MPC law, typically defined as the solution to a finite-horizon constrained optimal control problem, is rendered as a parametric function $\pi_\theta$ via explicit or partially implicit parameterization. Such parameterized actors provide a bridge between high-performance, safety-critical, and structure-preserving MPC schemes and the flexibility of function approximators or policy-gradient-based learning frameworks. Contemporary approaches span explicit neural surrogates of the MPC map, hybrid stacked architectures, differentiable solver unrollings, and actor-critic frameworks in reinforcement learning and batch learning settings.

1. Formalization: MPC as a Parameterized Policy

The central object is the receding-horizon optimal control law for a discrete-time system,

$x_{t+1} = f(x_t, u_t),\qquad x_t \in \mathbb{R}^{n_x},\; u_t \in \mathbb{R}^{n_u},$

with finite-horizon, possibly parameterized, cumulative cost

$J_\theta(x_0, u_{0:H-1}) = \sum_{t=0}^{H-1} \ell_\theta(x_t,u_t) + V_{f,\theta}(x_H)$

and constraints coupling $u_t,x_t$ and possibly time-varying, stochastic, or uncertain model information.

The classical (implicit) MPC law is

$U^*(x_0; \theta, H) = \arg\min_{u_{0:H-1}} J_\theta(x_0,u_{0:H-1}) \quad \text{s.t.} \quad x_{t+1} = f_\theta(x_t, u_t),\; g_\theta(x_t,u_t) \le 0.$

The policy (actor) is then

$\pi_\theta(x_0,H) = U^*(x_0;\theta,H)[0] = u_0^*(x_0; \theta, H).$

Parameterization enters through $\theta$ , which may collect cost weights, constraints, horizon, reference signals, learned model parameters, embedding weights (in explicit surrogates), or any other degrees of freedom in the control law. The policy map $\pi_\theta$ may be deployed directly as a closed-form controller, or serve as a differentiable policy within a gradient-based learning loop, typically an actor-critic or imitation-learning algorithm (Wu et al., 9 Sep 2025, Vahidi-Moghaddam et al., 5 Oct 2025, Zanon et al., 2021, Lu et al., 2023).

2. Differentiable and Surrogate Parameterizations

Explicit neural or transformer-based parameterizations replace the on-line MPC optimization by a trainable mapping, typically:

A deep neural network $U_\theta(P)$ , learned to approximate the state-action solution map $U^*(P)$ for a set of scenario parameters $x_{t+1} = f(x_t, u_t),\qquad x_t \in \mathbb{R}^{n_x},\; u_t \in \mathbb{R}^{n_u},$ 0 (state, references), generally using supervised or adversarial losses (Zhang et al., 2019, Burnwal et al., 2023).
Sequence-to-sequence structures, such as encoder-only Transformers, map state and planning horizon to a vectorized open-loop control sequence in parallel, affording $x_{t+1} = f(x_t, u_t),\qquad x_t \in \mathbb{R}^{n_x},\; u_t \in \mathbb{R}^{n_u},$ 1 inference at any $x_{t+1} = f(x_t, u_t),\qquad x_t \in \mathbb{R}^{n_x},\; u_t \in \mathbb{R}^{n_u},$ 2, as in TransMPC (Wu et al., 9 Sep 2025):

$x_{t+1} = f(x_t, u_t),\qquad x_t \in \mathbb{R}^{n_x},\; u_t \in \mathbb{R}^{n_u},$ 3

Hybrid surrogates pair neural networks with lower-dimensional high-level parameters that then drive a conventional MPC, as in hyMPC’s two-level architecture (Feng et al., 2024). The higher-level neural policy predicts explicit scenario parameters (e.g., task priorities, reference blending weights, plan times) which are injected into an MPC subproblem executed as-is downstream.

Automated differentiation through these architectures enables direct minimization of the true finite-horizon cost, with gradients passing through both the neural decoder and the system dynamics or constraint Jacobians (Wu et al., 9 Sep 2025, Carius et al., 2019). For surrogate or unrolled QP solvers, back-propagation may be implemented through a fixed number of primal-dual iterations (e.g., PDHG/RNN) or via implicit KKT sensitivities (Lu et al., 2023).

3. Training and Policy Optimization Paradigms

Training methodologies for parameterized MPC actors span direct policy optimization, supervised imitation, adversarial imitation, actor-critic RL, and hybrid techniques:

Direct minimization of finite-horizon cost: Given a differentiable policy $x_{t+1} = f(x_t, u_t),\qquad x_t \in \mathbb{R}^{n_x},\; u_t \in \mathbb{R}^{n_u},$ 4, compute

$x_{t+1} = f(x_t, u_t),\qquad x_t \in \mathbb{R}^{n_x},\; u_t \in \mathbb{R}^{n_u},$ 5

and update $x_{t+1} = f(x_t, u_t),\qquad x_t \in \mathbb{R}^{n_x},\; u_t \in \mathbb{R}^{n_u},$ 6 via stochastic gradient descent, with gradients backpropagated through both the parameterized policy map and the discrete-time system rollout (Wu et al., 9 Sep 2025).

Imitation learning: Supervised regression or Hamiltonian loss-based imitation from MPC-expert trajectories, often using mixture-of-expert policies to capture multimodality. Losses may be defined via metric distances to teacher actions or through optimality conditions, e.g. minimizing the control Hamiltonian or primal-dual value gaps (Carius et al., 2019, Zhang et al., 2019).
Adversarial learning: Matching the distribution of state or trajectory features between expert (possibly non-identical) and policy-induced rollouts, using GAN-based objectives that minimize Jensen-Shannon divergence (Burnwal et al., 2023).
Actor-critic/batch RL frameworks: Embedding the parameterized MPC as the actor in a deterministic or stochastic policy-gradient setting, with gradients computed via KKT- or unrolled- sensitivity analysis and explicit constraint-aware exploration strategies (Gros et al., 2020, Kordabad et al., 2021, Vahidi-Moghaddam et al., 5 Oct 2025, Zanon et al., 2021, Mallick et al., 2023).
Hybrid surrogate training: For computationally intensive MPCs, employing faster surrogates (small neural controllers or truncated rollouts) for gradient-based training, while deferring to the full MPC solution at deployment/deployment ensuring all constraints and stability guarantees (Lawrence et al., 1 Apr 2025).

4. Constraint Satisfaction, Feasibility, and Safety

Parameterizing MPC policies as actors opens the challenge of maintaining feasibility, constraint satisfaction, and robustness under learning:

Constraint-preserving parameterizations: By encoding all safety constraints inside the MPC optimization (or directly in the neural surrogate), policies are constrained to remain inside the feasible set at all times. In supervised settings, explicit feasibility constraints are enforced sample-wise; in RL, parameter updates may require projection onto certified feasible domains or addition of data-driven safety constraints at each iteration (Zhang et al., 2019, Gros et al., 2020, Gros et al., 2020).
Lyapunov and MPC stability guarantees: By parameterizing terminal cost, terminal set, and constraint margins and enforcing value-function decrease or joint Lyapunov conditions at each parameter update, stability is systematically certified alongside recursive feasibility across RL-driven $x_{t+1} = f(x_t, u_t),\qquad x_t \in \mathbb{R}^{n_x},\; u_t \in \mathbb{R}^{n_u},$ 7 adaptation (Gros et al., 2020, Zanon et al., 2021).
Exploration strategies: For safe exploration during RL, various strategies ensure that noise added to policy actions never violates constraints, including robustification via constraint tightening, one-step projection back to feasibility, and Gaussian-perturbed cost augmentations (Kordabad et al., 2021, Gros et al., 2019).
High-probability guarantees: Scenario-based or VC-dimension theory provides certificates of feasibility and near-optimality for supervised-learned surrogates, bounding the probability that a surrogate actor produces an infeasible or highly suboptimal control input (Zhang et al., 2019).

5. Computational Properties and Practical Implementations

Parameterized MPC actors address real-time deployment constraints by trading off between accuracy, latency, and memory footprint:

Explicit surrogate policies: Architectures such as TransMPC yield $x_{t+1} = f(x_t, u_t),\qquad x_t \in \mathbb{R}^{n_x},\; u_t \in \mathbb{R}^{n_u},$ 8 evaluation in the prediction horizon and outperform classical neural surrogates and RNN/MLP baselines in both speed and accuracy, with error rates remaining within $x_{t+1} = f(x_t, u_t),\qquad x_t \in \mathbb{R}^{n_x},\; u_t \in \mathbb{R}^{n_u},$ 9– $J_\theta(x_0, u_{0:H-1}) = \sum_{t=0}^{H-1} \ell_\theta(x_t,u_t) + V_{f,\theta}(x_H)$ 0 of the true MPC solution across variable horizon lengths (Wu et al., 9 Sep 2025).
Implementation on embedded hardware: Surrogate policies achieve $J_\theta(x_0, u_{0:H-1}) = \sum_{t=0}^{H-1} \ell_\theta(x_t,u_t) + V_{f,\theta}(x_H)$ 1– $J_\theta(x_0, u_{0:H-1}) = \sum_{t=0}^{H-1} \ell_\theta(x_t,u_t) + V_{f,\theta}(x_H)$ 2 speedups over classical online QP-based MPC on automotive-grade ECUs and similar platforms, with minimal compromise on closed-loop costs and constraint violation rates (Zhang et al., 2019, Lu et al., 2023).
Two-level hybridization: By learning only a low-dimensional set of scenario parameters or decision variables via neural function approximators, as in hyMPC, the system achieves the robustness and adaptivity of iterative trajectory optimization while maintaining computation budgets amenable to high-rate flight control (Feng et al., 2024).
Distributed and multi-agent contexts: Distributed MPC-based RL schemes enable decentralized learning and policy execution, with fully distributed computation of value function gradients, ADMM-based policy evaluation, and local parameter update protocols—all while preserving global constraint satisfaction and inter-agent coupling (Mallick et al., 2023).

6. Empirical Benchmarks and Case Studies

Recent work systematically benchmarks parameterized MPC actors against standard online MPC, generic neural-actor RL agents, and alternative optimal control surrogates:

Vehicle/robotics control: TransMPC demonstrates sub- $J_\theta(x_0, u_{0:H-1}) = \sum_{t=0}^{H-1} \ell_\theta(x_t,u_t) + V_{f,\theta}(x_H)$ 3 control sequence error against online MPC on nonlinear vehicle dynamics, with runtime 27% faster than a comparable MLP and up to $J_\theta(x_0, u_{0:H-1}) = \sum_{t=0}^{H-1} \ell_\theta(x_t,u_t) + V_{f,\theta}(x_H)$ 4 faster than RNN baselines at $J_\theta(x_0, u_{0:H-1}) = \sum_{t=0}^{H-1} \ell_\theta(x_t,u_t) + V_{f,\theta}(x_H)$ 5. Closed-loop trajectory tracking, lateral error, and cumulative cost closely match or outperform traditional MPC for moderate horizons (Wu et al., 9 Sep 2025).
Drone navigation: hyMPC achieves 100% success in swinging-gate crossing under unknown dynamics, generalizes well across initial conditions, and retains high robustness under actuator degradation tests (Feng et al., 2024).
Embedded real-time applications: DNN surrogate policies meet stringent real-time processing deadlines (e.g., $J_\theta(x_0, u_{0:H-1}) = \sum_{t=0}^{H-1} \ell_\theta(x_t,u_t) + V_{f,\theta}(x_H)$ 6ms at $J_\theta(x_0, u_{0:H-1}) = \sum_{t=0}^{H-1} \ell_\theta(x_t,u_t) + V_{f,\theta}(x_H)$ 7Hz control rates) with constraint-violation and suboptimality levels strictly bounded; backup controllers are rarely invoked (Zhang et al., 2019).
Imitation learning and distributional transfer: GAN-MPC learns explicit cost parameterizations that enable behavioral imitation from non-identical experts with only state-trajectory observation, robust to modeling errors and partial state overlap (Burnwal et al., 2023).

These results consistently indicate that parameterized MPC actors can achieve the joint objectives of interpretability, safety, hardware efficiency, and RL-compatibility in a wide range of complex continuous-control applications.

7. Architectural Variations and Future Directions

Current and future avenues in parameterized MPC actors include:

Broader policy classes: Extension beyond transformer and DNN surrogates to incorporate memory-augmented networks, hybrid symbolic-learned MPCs, and architectures leveraging attention over exogenous signals such as scenario probabilities or environment maps.
Rich parameterization: End-to-end learning of not only cost weights and references but also constraint sets, terminal-ingredient shaping, tube radii, or even the plant model (in ML-based MPC), with rigorous Lyapunov and safety certificates enforced via constrained or projection-based gradient steps (Gros et al., 2020).
Sample efficiency and robustness: Advances in batch RL, adversarial regularization, and advantage-weighted distillation aim to minimize the sample complexity and variance associated with model-driven versus data-driven parameter updates, including robustification protocols for high-variance or partially-observable domains (Amiri et al., 9 Apr 2026).
Hybrid and multi-level controllers: Two-level or hierarchical formulations (e.g., high-level task policies modulating low-level MPCs) permit decomposition of long-horizon planning, task transfer, or adaptation to previously unseen scenarios in real time (Feng et al., 2024).
Stability and safety in nonstationary and distributed settings: Systematic backtracking, feasibility filtering, and joint Lyapunov arguments ensure that online learning and networked-agent settings preserve core MPC guarantees at every parameter iterate (Gros et al., 2020, Mallick et al., 2023).

The unification of explicit and implicit MPC-based policies with differentiable programmatic actors has thus become a cornerstone of interpretable, safe, and real-time-compliant reinforcement learning and control (Wu et al., 9 Sep 2025, Vahidi-Moghaddam et al., 5 Oct 2025, Zhang et al., 2019, Feng et al., 2024).