Papers
Topics
Authors
Recent
Search
2000 character limit reached

APAC: Arbitrated Predictive Actor-Critic

Updated 2 April 2026
  • APAC is a modular control framework that fuses actor–critic RL with MPC to ensure robust and safe performance in nonlinear, constrained systems.
  • It employs parallel MPC solves with actor roll-out and a shifted warm start, effectively mitigating local minima and suboptimal trajectories.
  • The framework provides theoretical performance bounds based on critic accuracy and MPC horizon while demonstrating practical improvements in tasks like autonomous driving.

The Arbitrated Predictive Actor–Critic (APAC) algorithm is a modular control strategy that synthesizes actor–critic reinforcement learning (RL) with nonlinear model predictive control (MPC). APAC exploits the complementary strengths of RL and MPC by using a trained actor–critic pair to initialize and evaluate multiple parallel MPC solves at every timestep. The online arbitration between trajectories ensures robust performance and substantial improvement over standalone RL or MPC policies for nonlinear, constrained systems. APAC introduces performance guarantees based on critic accuracy and MPC horizon, enabling safe and efficient control without global optimality requirements (Reiter et al., 2024).

1. Problem Setting and Cost Structure

APAC is formulated for discrete-time nonlinear systems of the form

xk+1=f(xk,uk),xkRn,  ukURm,x_{k+1} = f(x_k, u_k), \quad x_k \in \mathbb{R}^n, \; u_k \in \mathcal{U} \subseteq \mathbb{R}^m,

where ff is smooth, and U\mathcal{U} is compact. The infinite-horizon cost, for stage cost (x,u)0\ell(x, u) \geq 0 and discount γ(0,1]\gamma \in (0,1], is

J(x0,{uk})=k=0γk(xk,uk),J^{\infty}(x_0, \{ u_k \}) = \sum_{k=0}^{\infty} \gamma^k \ell(x_k, u_k),

with xk+1=f(xk,uk),  x0x_{k+1}=f(x_k,u_k), \; x_0 given. For a stationary policy uk=π(xk)u_k=\pi(x_k), the cost is Jπ(x0)J_\pi(x_0), and the optimal value function J(x)=infπJπ(x)J^*(x)=\inf_\pi\, J_\pi(x).

MPC computes an approximate solution via finite horizon ff0 and terminal cost ff1:

ff2

with the optimization ff3.

This structure generalizes stochastic and deterministic optimal control, where the limitations of a short MPC horizon or approximate critic induce suboptimality.

2. Actor–Critic Representation and Training

APAC assumes offline availability of:

  • Critic ff4, trained to minimize the Bellman residual:

ff5

where ff6 is a replay buffer.

  • Actor ff7, trained to minimize the expected Q-value or value objective

ff8

The actor is used offline during training (e.g., with Soft Actor–Critic (SAC) or Proximal Policy Optimization (PPO)), and online for MPC initialization and trajectory roll-outs.

Actor trajectory roll-out for ff9 steps from state U\mathcal{U}0 constructs candidate actions

U\mathcal{U}1

producing U\mathcal{U}2.

3. Parallel MPC Architecture and Arbitration

At each timestep U\mathcal{U}3, APAC executes two parallel MPC optimizations:

  1. Actor-warm-start MPC: Initialized by the actor roll-out U\mathcal{U}4 from the current state.
  2. Shifted-warm-start MPC: Initialized by shifting the previous solution U\mathcal{U}5, appending U\mathcal{U}6.

Both solves use the RL critic U\mathcal{U}7 as terminal cost. After solving, two open-loop trajectories are obtained:

  • U\mathcal{U}8 (actor-warm-start)
  • U\mathcal{U}9 (shifted-warm-start)

The infinite-horizon cost for each trajectory is approximated as

(x,u)0\ell(x, u) \geq 00

(x,u)0\ell(x, u) \geq 01

Arbitration: The first input of the lower-cost trajectory (by (x,u)0\ell(x, u) \geq 02 vs. (x,u)0\ell(x, u) \geq 03) is applied to the plant, and the full trajectory is stored as next step’s shifted-warm-start candidate.

This parallel architecture increases computational effort only marginally (factor ≈2), but significantly mitigates local minima in MPC optimization and exploits the complementary behavior of actor–critic and classical solvers (Reiter et al., 2024).

4. Performance Guarantees

APAC provides theoretical guarantees under a uniform Bellman error bound for the critic:

(x,u)0\ell(x, u) \geq 04

With this, the APAC closed-loop cost (x,u)0\ell(x, u) \geq 05 versus the pure actor (x,u)0\ell(x, u) \geq 06 satisfies:

(x,u)0\ell(x, u) \geq 07

For the optimal value (x,u)0\ell(x, u) \geq 08, the following suboptimality bound holds for the APAC controller:

(x,u)0\ell(x, u) \geq 09

If the critic's Bellman error γ(0,1]\gamma \in (0,1]0 and the horizon γ(0,1]\gamma \in (0,1]1, APAC attains optimality. These bounds hold even when MPC optimizations are not globally optimal, as long as the parallel candidates are solved and evaluated as described.

5. Online Execution Flow

APAC’s online controller proceeds as follows:

  1. Initialization: At γ(0,1]\gamma \in (0,1]2, initialize shifted-warm guess γ(0,1]\gamma \in (0,1]3 (e.g., zeros or actor roll-out).
  2. Measurement: Observe current state γ(0,1]\gamma \in (0,1]4.
  3. Actor Roll-out: Simulate actor trajectory for γ(0,1]\gamma \in (0,1]5 steps to form γ(0,1]\gamma \in (0,1]6, γ(0,1]\gamma \in (0,1]7.
  4. Shifted Guess: Shift previous step's optimal input sequence and append a new actor step for γ(0,1]\gamma \in (0,1]8.
  5. Parallel MPC Solves: Solve two MPC instances (one from actor-warm-start, one from shifted-warm-start).
  6. Cost Evaluation: Compute approximate infinite-horizon costs γ(0,1]\gamma \in (0,1]9 and J(x0,{uk})=k=0γk(xk,uk),J^{\infty}(x_0, \{ u_k \}) = \sum_{k=0}^{\infty} \gamma^k \ell(x_k, u_k),0 for the two resulting trajectories.
  7. Arbitration: Apply the first control of the trajectory with lower cost; use the full selected trajectory as next step’s shifted guess.
  8. Iteration: Repeat at next timestep.

This control policy is modular, allowing the use of any actor–critic RL architecture and any MPC solver.

6. Implementation, Hyperparameters, and Case Studies

Key APAC hyperparameters include prediction horizon J(x0,{uk})=k=0γk(xk,uk),J^{\infty}(x_0, \{ u_k \}) = \sum_{k=0}^{\infty} \gamma^k \ell(x_k, u_k),1 (20–60 in cited applications), discount factor J(x0,{uk})=k=0γk(xk,uk),J^{\infty}(x_0, \{ u_k \}) = \sum_{k=0}^{\infty} \gamma^k \ell(x_k, u_k),2 (0.95–0.99), Bellman error J(x0,{uk})=k=0γk(xk,uk),J^{\infty}(x_0, \{ u_k \}) = \sum_{k=0}^{\infty} \gamma^k \ell(x_k, u_k),3 (tuned via RL training), and MPC solver configuration (tolerances, iterations).

Toy Problem ("Snow Hill"):

  • Double integrator dynamics with position-dependent deceleration.
  • Stage cost: J(x0,{uk})=k=0γk(xk,uk),J^{\infty}(x_0, \{ u_k \}) = \sum_{k=0}^{\infty} \gamma^k \ell(x_k, u_k),4 with J(x0,{uk})=k=0γk(xk,uk),J^{\infty}(x_0, \{ u_k \}) = \sum_{k=0}^{\infty} \gamma^k \ell(x_k, u_k),5.
  • Compared pure SAC, pure MPC, and APAC.
  • APAC reached the goal from all initial states, where MPC alone was often trapped and SAC was suboptimal. Closed-loop cost reduction was approximately J(x0,{uk})=k=0γk(xk,uk),J^{\infty}(x_0, \{ u_k \}) = \sum_{k=0}^{\infty} \gamma^k \ell(x_k, u_k),6 over SAC.

Autonomous Driving Overtaking:

  • Five-state single-track Frenet dynamics, constraints on velocity, acceleration, and collision avoidance.
  • RL actors from SAC and PPO; MPC horizon J(x0,{uk})=k=0γk(xk,uk),J^{\infty}(x_0, \{ u_k \}) = \sum_{k=0}^{\infty} \gamma^k \ell(x_k, u_k),7; RTI solver in acados.
  • APAC–RTI with actor warm start every J(x0,{uk})=k=0γk(xk,uk),J^{\infty}(x_0, \{ u_k \}) = \sum_{k=0}^{\infty} \gamma^k \ell(x_k, u_k),8 steps.
  • APAC–RTI outperformed pure MPC (J(x0,{uk})=k=0γk(xk,uk),J^{\infty}(x_0, \{ u_k \}) = \sum_{k=0}^{\infty} \gamma^k \ell(x_k, u_k),9–xk+1=f(xk,uk),  x0x_{k+1}=f(x_k,u_k), \; x_00 cost reduction for xk+1=f(xk,uk),  x0x_{k+1}=f(x_k,u_k), \; x_01) and both SAC and PPO, especially escaping local MPC minima.

Computation times (mean/max, ms) for xk+1=f(xk,uk),  x0x_{k+1}=f(x_k,u_k), \; x_02:

  • RL–SAC policy evaluation: xk+1=f(xk,uk),  x0x_{k+1}=f(x_k,u_k), \; x_03 (xk+1=f(xk,uk),  x0x_{k+1}=f(x_k,u_k), \; x_04)
  • MPC: xk+1=f(xk,uk),  x0x_{k+1}=f(x_k,u_k), \; x_05 (xk+1=f(xk,uk),  x0x_{k+1}=f(x_k,u_k), \; x_06)
  • APAC–RTI (SAC): xk+1=f(xk,uk),  x0x_{k+1}=f(x_k,u_k), \; x_07 (xk+1=f(xk,uk),  x0x_{k+1}=f(x_k,u_k), \; x_08)

APAC supports arbitrary actor–critic and MPC solver combinations, trading CPU cost for improved local and global control performance. Longer horizons and improved critic fit (xk+1=f(xk,uk),  x0x_{k+1}=f(x_k,u_k), \; x_09) tighten suboptimality bounds, but increase computational burden.

7. Modularity and Practical Remarks

The APAC framework is architecturally modular: any actor–critic pair (e.g., SAC, PPO) and any MPC solver (e.g., real-time iteration, sequential quadratic programming) can be used without modification. The method does not require globally optimal MPC solves; local solutions suffice for the performance guarantee, as arbitration over multiple candidate trajectories mitigates poor local minima. In real-time settings, limited-iteration RTI-based APAC controllers balance solution quality and computational cost.

APAC addresses the limitations of both RL and MPC: the actor–critic augments MPC with global information, while MPC leverages constraints and system models. This integration has been demonstrated to yield superior empirical and theoretical performance on a range of nonlinear, constrained control problems (Reiter et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Arbitrated Predictive Actor-Critic (APAC).