Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 82 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 18 tok/s
GPT-5 High 12 tok/s Pro
GPT-4o 96 tok/s
GPT OSS 120B 467 tok/s Pro
Kimi K2 217 tok/s Pro
2000 character limit reached

Probabilistic Ensemble with Trajectory Sampling (PETS)

Updated 29 August 2025
  • PETS is a model-based reinforcement learning framework that explicitly quantifies epistemic and aleatoric uncertainties using an ensemble of probabilistic neural networks.
  • It employs particle-based trajectory sampling combined with model predictive control (MPC) to plan robust actions with high sample efficiency.
  • Empirical results demonstrate PETS achieves competitive performance on benchmarks like half-cheetah with significantly fewer samples compared to model-free methods.

Probabilistic Ensemble with Trajectory Sampling (PETS) is a model-based reinforcement learning (RL) framework that leverages ensembles of probabilistic neural network dynamics models and sampling-based uncertainty propagation to achieve high sample efficiency and robust planning. PETS is designed to bridge the performance gap between data-efficient model-based approaches and high-capacity, deep model-free RL algorithms by explicitly quantifying both epistemic (model) and aleatoric (inherent) uncertainties and integrating them into planning through particle-based trajectory sampling and model predictive control.

1. Framework Definition and Core Principles

PETS involves learning a forward dynamics model as an ensemble of neural networks, each parameterizing a conditional probability distribution (most commonly Gaussian) over the next state given the current state and action. This ensemble-based construction serves two purposes:

  • Aleatoric uncertainty: Each network outputs a predictive mean and covariance, modeling inherent stochasticity in the environment (e.g., sensor noise).
  • Epistemic uncertainty: Disagreement among ensemble members captures model uncertainty arising from limited or non-diverse training data.

Planning in PETS is performed via Model Predictive Control (MPC). At every time step, candidate action sequences are generated (using the cross-entropy method, CEM, rather than uniform shooting) and evaluated by propagating multiple “particles” (samples) through the probabilistic dynamics model. Only the first action of the optimal sequence is executed, and planning is repeated at each time step. Two distinct variants of trajectory sampling are introduced:

  • TS1 ("Resample at every step"): The ensemble member governing prediction can change for each particle at each time step.
  • TS∞ ("Fixed model per particle"): Each trajectory particle is tied to a single model for its whole horizon, allowing isolation of epistemic and aleatoric sources of uncertainty.

2. Formal Mathematical Description

The key mathematical components governing PETS can be summarized as follows:

  • Learning the dynamics model:

st+1p(st+1st,at;θ)s_{t+1} \sim p(s_{t+1} | s_t, a_t; \theta)

The probabilistic network is trained to minimize the negative log-likelihood loss:

lossP=n=1Nlogpθ(sn+1sn,an)\mathrm{loss}_P = -\sum_{n=1}^N \log p_\theta(s_{n+1} | s_n, a_n)

For a Gaussian output:

lossGauss=n=1N{[μθ(sn,an)sn+1]Σθ(sn,an)1[μθ(sn,an)sn+1]+logdetΣθ(sn,an)}\mathrm{loss}_{\text{Gauss}} = \sum_{n=1}^N \big \{ [\mu_\theta(s_n, a_n) - s_{n+1}]^\top \Sigma_\theta(s_n, a_n)^{-1} [\mu_\theta(s_n, a_n) - s_{n+1}] + \log\det\Sigma_\theta(s_n, a_n) \big \}

  • Ensemble averaging:

p(st+1st,at)=1Bb=1Bpθb(st+1st,at)p(s_{t+1} | s_t, a_t) = \frac{1}{B} \sum_{b=1}^B p_{\theta_b}(s_{t+1} | s_t, a_t)

  • Planning objective via MPC:

at:t+T=argmaxat:t+Tτ=tt+TE[r(sτ,aτ)]a_{t:t+T}^{*} = \underset{a_{t:t+T}}{\arg\max} \sum_{\tau=t}^{t+T} \mathbb{E}[r(s_\tau, a_\tau)]

Subject to state propagation by the learned model and uncertainty-aware sampling using TS1/TS∞ to approximate p(st:t+Tst,at:t+T)p(s_{t:t+T} | s_t, a_{t:t+T}).

3. Sample Efficiency and Asymptotic Performance

Empirical evaluation demonstrates that PETS achieves competitive or superior asymptotic performance compared to model-free deep RL methods such as Soft Actor Critic (SAC) and Proximal Policy Optimization (PPO), but with drastically reduced sample requirements. For instance, on the half-cheetah benchmark, PETS attains comparable performance with 8× fewer samples than SAC and 125× fewer than PPO. This sample efficiency is attributed to:

  • Expressive uncertainty-aware dynamics modeling
  • Effective propagation and aggregation of uncertainty through ensembles
  • Focused trajectory planning via high-capacity optimizers (CEM)
  • Use of particle-based prediction to mitigate compounding model error over longer horizons

4. Benchmark Applications and Extensions

PETS has been evaluated across various continuous control benchmarks in MuJoCo, such as cartpole swing-up, pendulum, PR2 robot manipulator (pusher and reacher), and high-dimensional locomotion tasks (half-cheetah). On these benchmarks:

  • PETS matches or exceeds the performance of previous model-based algorithms, even in high-dimensional, contact-rich environments.
  • It robustly attains high reward in fewer than 100 trials across diverse domains.

The framework has since been extended and specialized for domains such as vehicle trajectory control with PETS-MPPI (Frauenknecht et al., 2023), decentralized multi-agent decision-making (MA-PETS for connected autonomous vehicles) (Wen et al., 2023), and robust adversarial planning (DR-PETS, using Wasserstein ambiguity sets for worst-case guarantees) (Jesawada et al., 26 Mar 2025).

5. Uncertainty Quantification and Propagation

A distinctive contribution of PETS is its explicit handling of both epistemic and aleatoric uncertainty:

  • Epistemic: Quantified by ensemble disagreement, crucial for directing exploration in regions of model uncertainty.
  • Aleatoric: Captured via predictive variances produced by each network, enabling the planner to avoid overconfident predictions in inherently noisy regions.

PETS separates these uncertainties through trajectory sampling schemes. This separation is critical for avoiding exploration of high-variance (non-learnable) regions and improving planning reliability.

Subsequent work (e.g., DPETS (Huang et al., 2023)) has refined uncertainty propagation by integrating restrictive Monte-Carlo Dropout, stabilizing the estimation and correction of fitting errors, and filtering aleatoric uncertainty during planning for improved control capability.

6. Planning via Model Predictive Control and Optimization

In PETS, MPC is the planning backbone, where action sequences are optimized by simulating multiple trajectories under the stochastic dynamics model. The cross-entropy method (CEM) is employed for sampling and refining candidate action sequences, which ensures the planner efficiently explores promising solutions rather than relying on exhaustive search.

Extensions have introduced Bayesian and variational approaches to the MPC stage (e.g., PaETS (Okada et al., 2019)), reformulating trajectory distribution optimization as a variational inference problem and allowing multimodal solutions via Gaussian mixture models over trajectories, further enhancing exploration and sample efficiency.

7. Implications, Limitations, and Future Directions

PETS marks a key advance in model-based RL. The framework demonstrates that rigorous uncertainty quantification and propagation, together with flexible particle-based planning, resolve the sample efficiency–performance tradeoff that previously constrained model-based methods. Relying exclusively on learned dynamics models, PETS highlights the viability of non-policy-centric planning at high capacity and in complex tasks.

However, limitations exist:

  • PETS can incur high computational overhead due to repeated online planning via MPC, especially in time-critical applications.
  • Planning is sensitive to the accuracy and calibration of uncertainty estimates; erroneous uncertainty can lead to conservative or risky actions.
  • In adversarial or worst-case regime shifts, standard PETS offers no robust guarantees; advances in distributionally robust planning (e.g., DR-PETS (Jesawada et al., 26 Mar 2025)) address this.

Notable future possibilities include amortized planning (policy learning on top of PETS), tighter integration with structural constraints and conformal calibration (Li et al., 18 Aug 2025), and adaptation to domains with multimodal, highly nonstationary behavior (e.g., autonomous driving, multi-agent negotiation, and safety-critical robotics).

Summary Table: PETS vs. Key Extensions and Benchmarks

Algorithm Uncertainty Treatment Planning Approach Sample Efficiency Robustness Guarantees
PETS Ensemble (epistemic) + Gaussian (aleatoric) MPC + CEM trajectory sampling High (8–125× better than SAC/PPO) No explicit adversarial defense
PaETS (Okada et al., 2019) Multimodal Gaussian Mixtures Variational MI in MPC Improved over PETS Expanded multimodal capture
DPETS (Huang et al., 2023) Dropout-based ensemble MPC, filters aleatoric Higher than PETS/MF RL Robust to external disturbance
DR-PETS (Jesawada et al., 26 Mar 2025) Wasserstein ambiguity set (robust) MPC+min-max optimization Comparable to PETS Certified adversarial robustness

PETS and its descendants provide a rigorous, flexible foundation for modern model-based RL and uncertainty-aware planning, drawing a direct connection between probabilistic prediction, ensemble learning, and high-performance control strategies in data-limited environments.