Papers
Topics
Authors
Recent
2000 character limit reached

CoRL-MPPI Framework

Updated 7 January 2026
  • The paper introduces a decentralized multi-robot planning method that integrates cooperative reinforcement learning with MPPI sampling to enhance collision avoidance.
  • It leverages an offline-trained neural policy to bias trajectory sampling, ensuring efficient and cooperative behaviors while enforcing safety with SOCP constraints.
  • Empirical results show high success rates and reduced makespan in dense dynamic environments compared to traditional MPPI-based approaches.

The CoRL-MPPI framework is a class of decentralized multi-robot motion planning controllers that fuse Cooperative Reinforcement Learning (CoRL) with Model Predictive Path Integral (MPPI) sampling to achieve efficient, provably-safe, and cooperative collision avoidance in dense, dynamic environments. This paradigm is motivated by the limitations of pure MPPI—namely, ignorance of multi-agent intent and inefficient random sampling—and addresses them by leveraging the learned behavioral priors of a deep neural policy, trained via multi-agent reinforcement learning, within the stochastic optimal control architecture of MPPI. All claims, equations, and metrics provided are directly grounded in the cited works.

1. Motivation and Theoretical Basis

Classical MPPI is a sampling-based Stochastic Model Predictive Control method suitable for nonlinear robotic systems and enjoys strong theoretical optimality and safety guarantees under appropriate conditions. However, it suffers from reliance on uninformed Gaussian sampling centered at a nominal control input. In dense multi-robot settings, most rollouts generated by such uninformed proposal distributions lead to collisions or deadlocks. MPPI also lacks any mechanism for cooperative intent prediction—each agent effectively plans in isolation, ignoring the dynamic strategies of its neighbors, often resulting in suboptimal or unsafe emergent behavior.

To address these limitations, CoRL-MPPI introduces an offline-learned decentralized policy π, trained via deep reinforcement learning with a reward structure designed to incentivize collision avoidance and cooperative progress. Embedding π as a proposal policy within the MPPI sampler biases the trajectory rollouts toward more sophisticated and implicitly cooperative behaviors, while preserving all stochastic optimal control guarantees of the underlying MPPI solver as long as the sampling distribution is maintained within the Gaussian class and safety constraints are enforced by convex optimization over the proposal parameters (Dergachev et al., 12 Nov 2025).

2. Core Algorithmic Components

2.1 MPPI Sampling and Update Law

At each planning cycle, agents solve the finite-horizon stochastic control problem:

u=argminuUHE[φ(xH)+t=0H1(q(xt)+γ2utΣ1ut)]u^* = \arg\min_{u \in \mathcal{U}^H} \mathbb{E} \left[ \varphi(x_H) + \sum_{t=0}^{H-1} \left( q(x_t) + \frac{\gamma}{2} u_t^\top \Sigma^{-1} u_t \right) \right]

with the controlled system evolving as xt+1=F(xt)+G(xt)νt, νtN(ut,Σ)x_{t+1} = F(x_t) + G(x_t) \nu_t, \ \nu_t \sim \mathcal{N}(u_t,\Sigma).

MPPI samples K perturbed control sequences utk=utinit+ϵtku^k_{t} = u^{\mathrm{init}}_{t} + \epsilon_{t}^{k}, ϵtkN(0,Σ)\epsilon_{t}^{k} \sim \mathcal{N}(0,\Sigma^*), propagates dynamics, computes trajectory costs S(xk,uk)S(x^k,u^k) including running, terminal, and control effort components (see itemized cost in (Dergachev et al., 12 Nov 2025)), and forms normalized exponential weights:

ωk=exp[(S(xk,uk)minlS(xl,ul))/λ]m=1Kexp[(S(xm,um)minlS(xl,ul))/λ]\omega^k = \frac{\exp\left[-(S(x^k,u^k)-\min_l S(x^l,u^l))/\lambda\right]}{\sum_{m=1}^{K} \exp\left[-(S(x^m,u^m)-\min_l S(x^l,u^l))/\lambda\right]}

The updated control at each timestep is the weighted average:

ut=k=1Kωkutku^*_{t} = \sum_{k=1}^{K} \omega^k u^k_{t}

2.2 Cooperative Policy Integration

The key innovation is to divide the rollouts into two branches:

  • MPPI branch: rollouts centered around prior solution uinitu^{\mathrm{init}}
  • RL branch: rollouts centered around policy mean uπu^\pi computed from π(o), where the observation oo encodes normalized goal direction and nearest-neighbor agent state information

Both Gaussians (means and covariances) are adaptively constrained, for the first HsafeH_{\mathrm{safe}} steps, by solving a convex SOCP enforcing probabilistic (e.g., ORCA-style) collision-avoidance constraints with violation probability <1δc<1-\delta_c.

Let u^tπ,Σ^tπ\hat{u}_t^\pi, \hat{\Sigma}_t^\pi (RL) and u^tmppi,Σ^tmppi\hat{u}_t^{\mathrm{mppi}}, \hat{\Sigma}_t^{\mathrm{mppi}} (MPPI) be the constrained means/covariances for each branch after the SOCP. Sample KπK_\pi rollouts using the RL branch and KKπK-K_\pi using the MPPI branch; form the mixture of proposals and proceed with standard MPPI weighting and update (Dergachev et al., 12 Nov 2025). This design allows direct inheritance of MPPI’s convergence and safety guarantees.

3. Neural Policy Training and Network Structure

The decentralized learned policy π is trained offline via Independent Proximal Policy Optimization (PPO) over a set of simulated multirobot scenarios (circular rings, mesh grids) intended to foster cooperative behaviors. Each agent’s observation otio_t^i is formed by concatenating goal-relative features and local neighbor kinematics. The policy outputs both the mean and covariance of the Gaussian action distribution for continuous velocity commands, bounded to the system limits.

The multi-agent training is formulated as a Decentralized POMDP with the following per-agent reward function:

  • +0.2+0.2 on goal
  • 1.0-1.0 on collision
  • +0.5(xtiτixt+1iτi)+0.5 \cdot (\lVert x_t^i - \tau_i \rVert - \lVert x_{t+1}^i - \tau_i \rVert) for progress toward goal

Training uses large-scale distributed simulation (32 agents, 60M env steps, <2<2 h walltime on H100 GPU). This produces a learned policy which implicitly encodes local negotiation and cooperative avoidance patterns but, in isolation, would not offer formal safety or out-of-distribution guarantees (Dergachev et al., 12 Nov 2025).

4. Safety and Theoretical Properties

Safety is enforced at rollout generation via per-step second-order cone programming to impose ORCA-style linearized collision constraints at each stage up to horizon HsafeH_{\mathrm{safe}}. The control distributions (both prior and RL proposal) are projected onto the feasible set, ensuring with prescribed probability δc\delta_c that samples are collision-free given the local agent state estimates. This mechanism is a direct extension of the MPPI-ORCA approach and retains the rigorous probabilistic safety guarantees of the underlying framework. The subsequent importance-weighting and control update nature are unaffected by the choice of prior or proposal mean as long as the overall sample set remains a mixture of Gaussians (Dergachev et al., 12 Nov 2025).

5. Empirical Evaluation and Performance

CoRL-MPPI was benchmarked on three scenario classes:

  • Circle: up to 50 agents on a ring with antipodal targets
  • Mesh: sparse (6×6 trained) and dense (5×5 evaluated) grid layouts
  • Random: agents with random placements/goals in a large field, representing out-of-distribution settings

The approach was evaluated against:

  • ORCA-DD (differential drive)
  • B-UAVC (Buffered Voronoi Cells with uncertainty)
  • MPPI-ORCA (standard MPPI with safety projection)

Key metrics:

  • Success Rate (SR): fraction of runs with all agents reaching their goal
  • Collision rate
  • Makespan: time until all agents complete tasks

Results for CoRL-MPPI:

  • 100%100\% SR on Random and Circle, 99.25%99.25\% on Mesh (Dense)
  • Collisions: 0%0\% on Random/Circle, 0.75%0.75\% on Mesh (Dense)
  • Up to 2×2\times reduced makespan over MPPI-ORCA in dense cases
  • Matched baseline generalization on Random, indicating no overfitting to training layouts

MPPI-ORCA exhibited lower SR and nonzero collision rates in dense regime; classical baselines underperformed severely in both SR and makespan (Dergachev et al., 12 Nov 2025).

6. Algorithmic Pseudocode and Workflow

Below is a concise workflow paraphrased from Algorithm 1 (Dergachev et al., 12 Nov 2025):

  1. For each agent, initialize predictions for both RL-guided and MPPI branches.
  2. For each step tt in horizon HH:
    • Predict neighbor states
    • Assemble observation vector ot1o_{t-1}
    • Query policy π for (ut1π,Σt1π)(u^\pi_{t-1}, \Sigma^\pi_{t-1})
    • Apply SOCP to obtain constrained means/covariances for both branches as needed
    • Propagate dynamics for both branches with constrained means
  3. Draw KπK_\pi rollouts from RL proposal and KKπK-K_\pi from MPPI, generate perturbed trajectories as per constrained covariances.
  4. Compute costs and weights for all rollouts, perform importance-sampling update of control.
  5. Apply first control in sequence, shift window, repeat.

Typical operating parameters: H=10H=10 (3 s), K=1500K=1500 rollouts (30% RL-guided), differential drive robots with state/action limits.

7. Relationship to Unified Control-Learning Formulations

CoRL-MPPI can be interpreted within a broader thermodynamic optimization framework that encompasses MPPI, policy-gradient RL, and diffusion model reverse sampling as variants of gradient ascent over energy-smoothed (Gibbs measure) control distributions (Li et al., 27 Feb 2025). In this view, RL policy and MPPI sampling both perform score-based updates, with statistical weighting via exponential transforms of cost/reward, and the practical fusion of learned and model-based rollouts in the CoRL-MPPI architecture is a particular instantiation of this unified paradigm.

8. Outlook and Limitations

The primary contribution of CoRL-MPPI is in closing the gap between learned cooperative behavior (which yields multi-agent coordination but lacks guarantees) and model-based optimal planning (which is safe but individually myopic in dynamic multi-agent settings). Biasing MPPI with a learned RL-guided branch produces substantial performance gains in dense and adversarial layouts while preserving safety and optimality proofs. Noted limitations include potential sim-to-real transfer challenges, static nature of the offline-trained policy, and the expressivity limitation of the policy network architecture. Future directions include online adaptation of the policy, graph neural net architectures for global awareness, and further integration of learning-based and safety-enforcing layers (Dergachev et al., 12 Nov 2025).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CoRL-MPPI Framework.