Information Theoretic MPC
- Information Theoretic MPC is a control methodology that integrates information theory concepts, specifically KL divergence and free-energy minimization, to balance cost and uncertainty.
- It employs sampling-based algorithms like MPPI and MPQ to iteratively update control actions within a receding-horizon framework, ensuring robustness and efficiency.
- Empirical results demonstrate that IT-MPC enhances sample efficiency, mitigates model bias, and adapts dynamically to nonlinear, stochastic dynamical systems.
Information Theoretic Model Predictive Control (IT-MPC) constitutes a class of model predictive control algorithms that explicitly embed information-theoretic regularization—most commonly Kullback-Leibler (KL) divergence terms or free-energy objectives—within the control optimization, facilitating robust, data-efficient closed-loop policies for complex and uncertain dynamical systems. IT-MPC provides a principled bridge between stochastic optimal control, entropy-regularized reinforcement learning (RL), and sampling-based algorithmic implementations, and has been empirically validated across high-performance robotic, navigation, and system identification benchmarks.
1. Mathematical Foundation: Free Energy, KL Divergence, and Policy Distribution
The core mathematical structure of IT-MPC is the minimization of a composite objective function over action (or control) sequences, balancing expected cumulative cost and distributional divergence from a prescribed reference policy. For a dynamical system with current state and open-loop action sequence , the generic IT-MPC finite-horizon objective is
with (possibly stochastic). The optimal distribution minimizing this functional—via Lagrangian duality and the calculus of variations—can be written in Boltzmann (exponentially tilted) form: where is the normalization constant ("partition function"). This structure can be interpreted as performing path integral control with a temperature parameter controlling the cost-to-uncertainty trade-off (Williams et al., 2017, Bhardwaj et al., 2019).
2. Relationship to Entropy-Regularized Reinforcement Learning
A central theoretical contribution is the equivalence between KL-regularized finite-horizon control and entropy-regularized RL in infinite-horizon Markov decision processes (MDPs). In this RL formulation, the control objective becomes
where is an entropy temperature. The optimal per-step action distribution is
with the solution to the soft Bellman optimality operator: Implementation via soft Q-learning or fitted Q-iteration employs this structure to drive off-policy updates from sampled or simulated experience (Bhardwaj et al., 2019).
3. Sampling-Based Iterative Control Algorithms
In practical large-scale systems, and associated expectations are intractable, necessitating sampling-based approximations. The Model Predictive Path Integral (MPPI) algorithm—a canonical IT-MPC instance—uses a parametric Gaussian proposal over open-loop action sequences, with iterative importance weighting to update the control mean. The prototypical update step is
where , and
Only the first component of the updated mean sequence is applied to the real system; the process is repeated at each timestep, constituting the receding-horizon MPC structure (Bhardwaj et al., 2019, Williams et al., 2017).
4. Model Predictive Q-Learning and Bias Mitigation
The Model Predictive Q-Learning (MPQ) algorithm addresses the core limitations of short-horizon, biased-model MPC by integrating online planning with a model-based simulator and offline, model-free Q-learning from real system interactions. The procedure is summarized by the following sequence:
- At time , run MPPI iterations with the current soft Q-function as terminal value at horizon .
- Execute the first resulting action on the real system, storing the transition.
- Periodically, sample minibatches from the replay buffer, form soft Bellman targets using offline real data and rerun MPPI for necessary free-energy estimates, updating via mean-squared Bellman residual minimization.
This approach overcomes compounding errors of short-horizon, biased-model planning by incorporating a learned global Q-function as terminal value, effectively extending the open-loop planning horizon to infinity and providing model error correction (Bhardwaj et al., 2019).
The full algorithmic pseudocode is:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
Algorithm 1: Model Predictive Q-Learning (MPQ)
Input: biased simulator P, initial Q-network Q_θ, replay buffer D
Hyperparameters: episodes N, episode length T, planning horizon H, update interval U, batch size B
for episode = 1 to N do
for t = 1 to T do
Use MPPI with terminal Q=Q_θ to plan H-step actions (aₜ,…,aₜ₊H₋₁)
Execute aₜ on the real system, observe cost cₜ and next state xₜ₊₁
Store (xₜ,aₜ,cₜ,xₜ₊₁) in D
if episode mod U == 0 then
Sample M minibatches of size B from D
For each (x,a,c,x′), compute free-energy target using MPPI from x′ + soft-Bellman Eq.
Perform gradient step on L(θ)=E[(y−Q_θ(x,a))²]
return final θ |
5. Empirical Validation and Numerical Performance
The MPQ framework has been validated on control tasks including Pendulum Swing-up, Ball-in-Cup with sparse rewards, FetchPush, and Franka Drawer Opening. Demonstrated benefits include:
- Strong sample efficiency: MPQ attains effective policies with far fewer real-system interactions than model-free soft Q-learning.
- Improved robustness: MPQ outperforms stand-alone MPPI even with access to true dynamics, indicating effective correction of model bias and indifference to sparsity in the reward signal.
- Automatic horizon adaptation: The learned terminal Q-function dynamically extends the effective planning horizon based on task requirements and system complexity (Bhardwaj et al., 2019).
6. Theoretical Implications and Algorithmic Properties
Key theoretical contributions consolidated by IT-MPC research include:
- Formal equivalence between information-theoretic MPC (free-energy minimization) and entropy-regularized RL (soft or KL-regularized value iteration).
- Derivation of the H-step Boltzmann optimal controller and its infinite-horizon extension via the MPPI update rule.
- Empirical evidence that information-regularized planning with learned soft Q-functions yields robustness to model-bias and improved data efficiency.
Entropy-regularization (the KL term) crucially limits overcommitting to flawed models and dampens catastrophic overfitting of the planner to simulator discrepancies.
7. Context within Broader Literature
The IT-MPC methodology subsumes various earlier approaches, including path integral control, soft Q-learning, and the model predictive path integral (MPPI) algorithm. It provides a unified perspective on control under uncertainty by unifying sampling-based and entropy-regularized formulations, systematically leveraging both (1) limited-horizon, model-based planning and (2) model-free, value-based correction from real transitions (Williams et al., 2017, Bhardwaj et al., 2019).
The approach is distinguished by its capability to handle both continuous and discrete actions, arbitrarily nonlinear dynamics, and catastrophic model mismatch, operating under real-time computational constraints via massive parallelization (e.g., GPU-based rollouts). Its applications include aggressive driving, robotic manipulation, and complex stochastic control with compound Poisson noise and non-Gaussian disturbances.
In summary, Information Theoretic Model Predictive Control unites concepts from control theory, reinforcement learning, and statistical inference to yield robust, scalable, and sample-efficient closed-loop control for complex systems. The paradigm's theoretical underpinning and empirical success have positioned IT-MPC as a fundamental architecture in modern learning-based optimal control (Williams et al., 2017, Bhardwaj et al., 2019).