Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 178 tok/s
Gemini 2.5 Pro 50 tok/s Pro
GPT-5 Medium 38 tok/s Pro
GPT-5 High 40 tok/s Pro
GPT-4o 56 tok/s Pro
Kimi K2 191 tok/s Pro
GPT OSS 120B 445 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

Information Theoretic MPC

Updated 11 November 2025
  • Information Theoretic MPC is a control methodology that integrates information theory concepts, specifically KL divergence and free-energy minimization, to balance cost and uncertainty.
  • It employs sampling-based algorithms like MPPI and MPQ to iteratively update control actions within a receding-horizon framework, ensuring robustness and efficiency.
  • Empirical results demonstrate that IT-MPC enhances sample efficiency, mitigates model bias, and adapts dynamically to nonlinear, stochastic dynamical systems.

Information Theoretic Model Predictive Control (IT-MPC) constitutes a class of model predictive control algorithms that explicitly embed information-theoretic regularization—most commonly Kullback-Leibler (KL) divergence terms or free-energy objectives—within the control optimization, facilitating robust, data-efficient closed-loop policies for complex and uncertain dynamical systems. IT-MPC provides a principled bridge between stochastic optimal control, entropy-regularized reinforcement learning (RL), and sampling-based algorithmic implementations, and has been empirically validated across high-performance robotic, navigation, and system identification benchmarks.

1. Mathematical Foundation: Free Energy, KL Divergence, and Policy Distribution

The core mathematical structure of IT-MPC is the minimization of a composite objective function over action (or control) sequences, balancing expected cumulative cost and distributional divergence from a prescribed reference policy. For a dynamical system with current state x0x_0 and open-loop action sequence U=(u0,,uH1)U = (u_0, \ldots, u_{H-1}), the generic IT-MPC finite-horizon objective is

J(π)=EUπ(x0)[t=0H1(xt,ut)]+1βKL(π(Ux0)πref(Ux0)),J(\pi) = \mathbb{E}_{U\sim\pi(\cdot|x_0)}\left[\sum_{t=0}^{H-1} \ell(x_t, u_t) \right] + \frac{1}{\beta} KL\left( \pi(U|x_0) \Vert \pi_{ref}(U|x_0) \right),

with xt+1=f(xt,ut)x_{t+1} = f(x_t, u_t) (possibly stochastic). The optimal distribution minimizing this functional—via Lagrangian duality and the calculus of variations—can be written in Boltzmann (exponentially tilted) form: π(Ux0)=1ηπref(Ux0)exp(βt=0H1(xt,ut)),\pi^*(U|x_0) = \frac{1}{\eta} \pi_{ref}(U|x_0) \exp\left(-\beta \sum_{t=0}^{H-1} \ell(x_t, u_t)\right), where η\eta is the normalization constant ("partition function"). This structure can be interpreted as performing path integral control with a temperature parameter β\beta controlling the cost-to-uncertainty trade-off (Williams et al., 2017, Bhardwaj et al., 2019).

2. Relationship to Entropy-Regularized Reinforcement Learning

A central theoretical contribution is the equivalence between KL-regularized finite-horizon control and entropy-regularized RL in infinite-horizon Markov decision processes (MDPs). In this RL formulation, the control objective becomes

minπEπ[t=0γt(c(xt,ut)+λKL(π(xt)πref(xt)))],\min_{\pi} \mathbb{E}_{\pi}\left[\sum_{t=0}^{\infty} \gamma^t \big( c(x_t, u_t) + \lambda \mathrm{KL}(\pi(\cdot|x_t)\Vert \pi_{ref}(\cdot|x_t)) \big) \right],

where λ\lambda is an entropy temperature. The optimal per-step action distribution is

π(ux)πref(ux)exp(Q(x,u)/λ),\pi^*(u|x) \propto \pi_{ref}(u|x) \exp\left( -Q^*(x, u)/\lambda \right),

with QQ^* the solution to the soft Bellman optimality operator: Q(x,u)=c(x,u)+γExx,u[λlogπref(ux)exp(Q(x,u)/λ)du].Q^*(x, u) = c(x, u) + \gamma \mathbb{E}_{x'\mid x, u} \left[ -\lambda \log \int \pi_{ref}(u'|x') \exp\left( -Q^*(x', u')/\lambda \right) du' \right]. Implementation via soft Q-learning or fitted Q-iteration employs this structure to drive off-policy updates from sampled or simulated experience (Bhardwaj et al., 2019).

3. Sampling-Based Iterative Control Algorithms

In practical large-scale systems, π\pi^* and associated expectations are intractable, necessitating sampling-based approximations. The Model Predictive Path Integral (MPPI) algorithm—a canonical IT-MPC instance—uses a parametric Gaussian proposal over open-loop action sequences, with iterative importance weighting to update the control mean. The prototypical update step is

uˉtnew=uˉt+αn=1Nw(n)ϵt(n),\bar{u}_t^{new} = \bar{u}_t + \alpha \sum_{n=1}^N w^{(n)} \epsilon_t^{(n)},

where ϵ(n)N(0,Σ)\epsilon^{(n)} \sim \mathcal{N}(0, \Sigma), and

w(n)exp{1λ[l=0H2c(xt+l,uˉt+l+ϵt+l(n))+λpenalty+Q(xt+H1,uˉt+H1+ϵt+H1(n))]}.w^{(n)} \propto \exp\left\{ -\frac{1}{\lambda} \left[ \sum_{l=0}^{H-2} c(x_{t+l}, \bar{u}_{t+l} + \epsilon_{t+l}^{(n)}) + \lambda \cdot \mathrm{penalty} + Q^*(x_{t+H-1}, \bar{u}_{t+H-1} + \epsilon_{t+H-1}^{(n)}) \right] \right\}.

Only the first component of the updated mean sequence is applied to the real system; the process is repeated at each timestep, constituting the receding-horizon MPC structure (Bhardwaj et al., 2019, Williams et al., 2017).

4. Model Predictive Q-Learning and Bias Mitigation

The Model Predictive Q-Learning (MPQ) algorithm addresses the core limitations of short-horizon, biased-model MPC by integrating online planning with a model-based simulator and offline, model-free Q-learning from real system interactions. The procedure is summarized by the following sequence:

  1. At time tt, run KK MPPI iterations with the current soft Q-function QθQ_\theta as terminal value at horizon H1H-1.
  2. Execute the first resulting action on the real system, storing the transition.
  3. Periodically, sample minibatches from the replay buffer, form soft Bellman targets using offline real data and rerun MPPI for necessary free-energy estimates, updating QθQ_\theta via mean-squared Bellman residual minimization.

This approach overcomes compounding errors of short-horizon, biased-model planning by incorporating a learned global Q-function as terminal value, effectively extending the open-loop planning horizon to infinity and providing model error correction (Bhardwaj et al., 2019).

The full algorithmic pseudocode is:

1
2
3
4
5
6
7
8
9
10
11
12
13
Algorithm 1: Model Predictive Q-Learning (MPQ)
Input:  biased simulator P, initial Q-network Q_θ, replay buffer D
Hyperparameters: episodes N, episode length T, planning horizon H, update interval U, batch size B
for episode = 1 to N do
    for t = 1 to T do
        Use MPPI with terminal Q=Q_θ to plan H-step actions (aₜ,…,aₜ₊H₋₁)
        Execute aₜ on the real system, observe cost cₜ and next state xₜ₊₁
        Store (xₜ,aₜ,cₜ,xₜ₊₁) in D
    if episode mod U == 0 then
        Sample M minibatches of size B from D
        For each (x,a,c,x′), compute free-energy target using MPPI from x′ + soft-Bellman Eq.
        Perform gradient step on L(θ)=E[(y−Q_θ(x,a))²]
return final θ

5. Empirical Validation and Numerical Performance

The MPQ framework has been validated on control tasks including Pendulum Swing-up, Ball-in-Cup with sparse rewards, FetchPush, and Franka Drawer Opening. Demonstrated benefits include:

  • Strong sample efficiency: MPQ attains effective policies with far fewer real-system interactions than model-free soft Q-learning.
  • Improved robustness: MPQ outperforms stand-alone MPPI even with access to true dynamics, indicating effective correction of model bias and indifference to sparsity in the reward signal.
  • Automatic horizon adaptation: The learned terminal Q-function dynamically extends the effective planning horizon based on task requirements and system complexity (Bhardwaj et al., 2019).

6. Theoretical Implications and Algorithmic Properties

Key theoretical contributions consolidated by IT-MPC research include:

  • Formal equivalence between information-theoretic MPC (free-energy minimization) and entropy-regularized RL (soft or KL-regularized value iteration).
  • Derivation of the H-step Boltzmann optimal controller and its infinite-horizon extension via the MPPI update rule.
  • Empirical evidence that information-regularized planning with learned soft Q-functions yields robustness to model-bias and improved data efficiency.

Entropy-regularization (the KL term) crucially limits overcommitting to flawed models and dampens catastrophic overfitting of the planner to simulator discrepancies.

7. Context within Broader Literature

The IT-MPC methodology subsumes various earlier approaches, including path integral control, soft Q-learning, and the model predictive path integral (MPPI) algorithm. It provides a unified perspective on control under uncertainty by unifying sampling-based and entropy-regularized formulations, systematically leveraging both (1) limited-horizon, model-based planning and (2) model-free, value-based correction from real transitions (Williams et al., 2017, Bhardwaj et al., 2019).

The approach is distinguished by its capability to handle both continuous and discrete actions, arbitrarily nonlinear dynamics, and catastrophic model mismatch, operating under real-time computational constraints via massive parallelization (e.g., GPU-based rollouts). Its applications include aggressive driving, robotic manipulation, and complex stochastic control with compound Poisson noise and non-Gaussian disturbances.


In summary, Information Theoretic Model Predictive Control unites concepts from control theory, reinforcement learning, and statistical inference to yield robust, scalable, and sample-efficient closed-loop control for complex systems. The paradigm's theoretical underpinning and empirical success have positioned IT-MPC as a fundamental architecture in modern learning-based optimal control (Williams et al., 2017, Bhardwaj et al., 2019).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Information Theoretic Model Predictive Control (IT-MPC).