Papers
Topics
Authors
Recent
Search
2000 character limit reached

Bootstrapped MPC (BMPC)

Updated 18 March 2026
  • BMPC is a control framework that bootstraps traditional MPC with offline-trained generative models and policy refinement to optimize high-dimensional continuous control tasks.
  • It leverages expert planning outputs to train neural policies and value functions, significantly reducing required rollouts and computational cost.
  • The approach enhances robustness and adaptability in real-world applications by combining multi-modal exploration with efficient planning and lazy reanalysis.

Bootstrapped Model Predictive Control (BMPC) is a class of algorithms that leverages the complementary strengths of Model Predictive Control (MPC) and learned neural policies. BMPC encompasses a family of techniques that "bootstrap" planning and learning—either by amortizing or imitating the outputs of online sampling-based MPC with offline generative or policy models, or by iterating between expert planning and policy refinement. The objectives are improved data efficiency, planning robustness, and adaptability for high-dimensional, continuous control, including contact-rich and real-world tasks (Brudermüller et al., 16 Oct 2025, Wang et al., 24 Mar 2025).

1. Theoretical Foundations

MPC solves a finite-horizon optimal control problem by repeatedly optimizing control sequences under a learned or known model. With state x0Rn\mathbf{x}_0 \in \mathbb{R}^n, the objective is to minimize cumulative stage plus terminal cost: minu0:Tϕ(xT+1)+τ=0T(xτ,uτ)\min_{\mathbf{u}_{0:T}} \phi(\mathbf{x}_{T+1}) + \sum_{\tau=0}^T \ell(\mathbf{x}_\tau, \mathbf{u}_\tau) subject to xτ+1=f(xτ,uτ)\mathbf{x}_{\tau+1} = f(\mathbf{x}_\tau, \mathbf{u}_\tau) and x0=xinit\mathbf{x}_0 = \mathbf{x}_{\mathrm{init}}.

Sampling-based MPC (SPC) methods, such as the Cross-Entropy Method (CEM) and Model Predictive Path Integral (MPPI), optimize a parameterized proposal distribution πϕ(U)\pi_\phi(U) over candidate action sequences, iteratively refitting to elite samples at every replanning step.

BMPC introduces a bootstrapping loop:

  • In one variant, successful sequences from SPC are used to train an offline generative model (e.g., a conditional flow model), which then informs online sampling in SPC by providing high-quality proposals.
  • In another variant, neural policies (and value functions) are trained by imitating MPC expert rollouts, and then injected back as proposals and terminal values into the planner, iteratively improving both planning and learning (Wang et al., 24 Mar 2025).

2. Algorithmic Structure and Pseudocode

BMPC can be instantiated via distinct algorithmic paths:

  1. Offline Phase: Collect dataset D={(xτ(i),hτ(i),Uτ(i))}\mathcal{D} = \{(\mathbf{x}_\tau^{(i)}, \mathbf{h}_\tau^{(i)}, U_\tau^{*\,(i)})\} of states, short history, and SPC-optimized sequences from simulation.
  2. Model Training: Train a conditional flow-matching generative model vθ(U,txτ,hτ)v_\theta(U, t | \mathbf{x}_\tau, \mathbf{h}_\tau) to learn the transport from a simple prior p0(U)p_0(U) to the empirical SPC distribution p1(Uxτ,hτ)p_1(U | \mathbf{x}_\tau, \mathbf{h}_\tau) via a flow-matching loss,

minθE[vθ(Ut,txτ,hτ)(U1U0)2]\min_\theta \mathbb{E}\left[ \left\| v_\theta(U_t, t | \mathbf{x}_\tau, \mathbf{h}_\tau) - (U_1 - U_0) \right\|^2 \right]

where Ut=tU1+(1t)U0U_t = t U_1 + (1-t) U_0.

  1. Online Phase: At every time step, draw a mixture of samples from CEM’s Gaussian and the generative model; combine these for rollouts, refit to the elite, and execute.
  1. Data Collection: Roll out MPC, guided by the current policy πθ\pi_\theta and terminal value VϕV_\phi, recording (st,at,rt,st+1,πtexpert)(s_t, a_t, r_t, s_{t+1}, \pi^{\mathrm{expert}}_t) in the buffer.
  2. Network Updates:

    • Policy: Minimize imitation loss,

    Limit(θ)=E[KL(πMPC(xt)πθ(xt))/SβH(πθ(xt))]\mathcal{L}_{\rm imit}(\theta) = \mathbb{E}\left[ \mathrm{KL}(\pi^{\mathrm{MPC}}(\cdot | x_t) \| \pi_\theta(\cdot | x_t)) / S - \beta \mathcal{H}(\pi_\theta(\cdot | x_t)) \right]

  • Value: Minimize NN-step model-based TD loss,

    LTD(ϕ)=E[i=0HλiCE(Vϕ(zt+i),V^t+i)]\mathcal{L}_{\rm TD}(\phi) = \mathbb{E}\left[ \sum_{i=0}^{H} \lambda^i\, \mathrm{CE}(V_\phi(z_{t+i}), \hat V_{t+i}) \right]

  1. Lazy Reanalyze: Efficiently refresh small batches of replay buffer expert targets by rerunning MPC, amortizing planning cost.

3. Integration of Generative Models with Sampling-Based MPC

Mixing samples from offline-trained generative models with traditional Gaussian proposals achieves the following:

  • The offline flow-matching model localizes proposals to successful regions, greatly reducing wasted sampling in SPC.
  • The online sampling loop maintains adaptability, as Gaussian samples provide coverage for out-of-distribution states, preventing sample impoverishment and mode collapse.

After training, generative models (flow-matching or policy) amortize much of the exploration cost. The resulting mixture approach enables significant reductions in rollout count per iteration or planning horizon length, enhancing practicality for both simulated and real robotic settings (Brudermüller et al., 16 Oct 2025).

4. Computational Efficiency: Lazy Reanalyze Mechanism

BMPC with policy imitation can suffer from high compute cost if all buffer states are continually replanned by MPC. The "lazy reanalyze" mechanism sidesteps this:

  • Only a small fraction of buffer samples are selectively re-analyzed by MPC at fixed intervals, replacing their stored expert distributions.
  • Empirically, this achieves an effective reanalyze ratio of 0.8%\sim 0.8\%, yielding 100×\sim 100\times planning cost reduction versus full-ratio methods such as EfficientZero (Wang et al., 24 Mar 2025).
  • The majority of policy updates are performed using the most recently available expert targets, benefiting from the combined effect of data amortization and targeted expert refresh.

5. Empirical Performance and Benchmarking

Extensive experimental analysis demonstrates the practical benefits of BMPC variants:

Method Succ. Rate (Push-T) Steps (Push-T) Succ. Rate (Spot) Steps (Spot) CEM Ratio
CEM 0.85 1040±5301040\pm530 0.33 1452±4481452\pm448
MPPI 0.62 1634±4921634\pm492 0.57 1096±3601096\pm360
Dial-MPC 0.86 1197±4611197\pm461
GPC-CEM 0.998 591±267591\pm267 0.83 1126±4301126\pm430 0.330.690.33\textrm{--}0.69
  • Sample efficiency: GPC-CEM reduces the required rollouts per iteration by up to 67% on Push-T.
  • Reduced horizon: Maintains >95%>95\% success with 1 s horizons where CEM drops below 80%.
  • Generalization: A single flow model can generalize to unseen task variants (e.g., K-block in Push-T), without retraining (Brudermüller et al., 16 Oct 2025).
  • Real-world deployment: Achieves 60%60\% success on Spot quadruped manipulation hardware at 5 Hz, whereas CEM alone yields only 10%10\% success.

BMPC with expert-imitation and value learning achieves 3×3\times data efficiency gains over TD-MPC2 in high-dimensional locomotion tasks, robust convergence, and reduced wall time (e.g., Dog Walk solve time from $2.03$ h to $0.87$ h) (Wang et al., 24 Mar 2025). In action spaces up to R61\mathbb{R}^{61}, BMPC outperforms SAC, DreamerV3, and TD-MPC2, and with 2×2\times smaller network capacity.

6. Strengths, Limitations, and Practical Considerations

Strengths

  • Sample Efficiency: Both generative proposal and policy-imitation BMPC reduce rollout counts and increase learning speed.
  • Multi-modal Exploration: Flow-based generative models capture diverse, non-myopic strategies, enabling robust performance on tasks with multiple solution modes.
  • Adaptability and Robustness: BMPC generalizes robustly to task variations and achieves consistent convergence in high-dimensional settings.
  • Computational Efficiency: The lazy reanalyze approach reduces planning computation by orders of magnitude relative to continuous expert-based amortization.

Limitations

  • Sim-to-Real Transfer: Performance may degrade when sim dynamics diverge from hardware; domain randomization is suggested as a future solution.
  • Goal Generalization: BMPC offline datasets are typically collected for fixed goal configurations; extension to dynamic or conditional goal representations is nontrivial.
  • Observation Constraints: Current instantiations operate on state-based inputs, without vision.
  • Simulation Bottleneck: Forward rollout simulation remains the dominant runtime cost (>90%>90\% run-time). GPU or learned model acceleration is not yet integrated.
  • Policy/Value Horizon Mismatch: For learned models, short-horizon planning (H=3H=3) and N=1N=1 for TD is typically used due to world model limitations (Wang et al., 24 Mar 2025).

7. Relationship to Prior Methods and Implications

BMPC establishes a framework for closing the gap between high-quality on-line planners and neural policies by iterative bootstrapping. Unlike methods relying exclusively on model-free learning or pure sampling, BMPC leverages the expert knowledge encoded via planning to guide policy and value updates, while further utilizing learned models to amortize and accelerate future planning. The result is improved efficiency and stability for continuous-control tasks across simulation and real hardware (Brudermüller et al., 16 Oct 2025, Wang et al., 24 Mar 2025).

A plausible implication is that further advances marrying generative modeling and planning could address outstanding limitations such as sim-to-real shift and perception-action integration, pushing the applicability of BMPC-type methods to broader classes of real-world autonomous systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Bootstrapped Model Predictive Control (BMPC).