Vehicle Dynamics Embedded Dreamer (VDD)
- VDD is a model-based reinforcement learning framework that decouples ego-vehicle dynamics from environmental transitions to improve policy robustness.
- It employs a latent state split into ego and environmental components with hierarchical context-aware modeling and latent space planning.
- Robust strategies like PAD and PAT enable zero-shot adaptation to varied vehicle parameters, outperforming standard RL approaches in autonomous driving.
The Vehicle Dynamics Embedded Dreamer (VDD) is an extension of model-based reinforcement learning (RL) for autonomous driving, designed to improve policy robustness and generalization across vehicles with varying physical parameters. VDD achieves this by explicitly decoupling the modeling of ego-vehicle dynamics from environmental transition dynamics in the world model, while introducing targeted mechanics for handling shifts in vehicle physical parameters. This framework outperforms standard RL and world-model-based approaches, particularly in robustness to dynamic variations and smoothness of control, demonstrating enhanced applicability for real-world autonomous driving scenarios (Li et al., 2 Dec 2025).
1. Model Architecture and Latent Dynamics
VDD builds upon the Dreamer family by organizing the latent state into ego-vehicle and environmental components. High-dimensional sensory observations (e.g., BEV images, LiDAR) and control actions are mapped as follows:
- : Low-dimensional ego-vehicle state (e.g., )
- : Latent environmental state, parameterized by , where is the RNN hidden state and is a stochastic latent (categorical or Gaussian)
- : Ego-vehicle dynamics parameters (e.g., mass, wheelbase)
The hierarchical context-aware RSSM (hcRSSM) transitions at each time step as follows:
- Ego-dynamics update:
- Environmental deterministic state:
- Prior over :
- Posterior inference:
Observation, reward, and continue-flag decoders reconstruct , , and from . The total environmental loss combines prediction, prior consistency (KL), and representation losses. The vehicle dynamics model is optimized independently via MSE: .
2. Decoupling of Ego-Vehicle and Environmental Dynamics
VDD factorizes world transitions, reflecting the insight that environmental transitions depend on the ego-vehicle state but not directly on the action, while ego-vehicle transitions depend only on intrinsic vehicle parameters:
The environmental model (RSSM, parameterized by ) and the vehicle model (either physics-based or learned, parameterized by ) are optimized independently, leading to the additive objective:
This separation allows the world model to generalize across vehicles with varied physical dynamics.
3. Policy Learning and Latent Space Planning
VDD adopts Dreamer-style actor-critic reinforcement learning with policy learning occurring entirely in latent space. The components are:
- Goal Generator (Actor): defines a distribution over goals conditioned on environment state.
- Controller: is a fixed feedback controller (e.g., PID), tracking the generated goal given the current ego-vehicle state.
- Critic: estimates return in the latent environment.
Imagined rollouts in latent space simulate futures; at each step, , , , and . Returns are computed as -returns. Critic and actor objectives include two-hot regression, entropy regularization, reward maximization, and a reachability penalty that encourages generating reachable goals in steps.
4. Robustness Strategies: PAD and PAT
VDD introduces two targeted mechanisms to improve robustness to changing vehicle parameters:
- Policy Adjustment during Deployment (PAD): For vehicles with dynamics , PAD maps each action learned under training parameters to a corresponding such that the next ego-state under matches the planned next state under :
Here, is the inverse dynamics mapping of the new vehicle. This preserves policy intent despite parameter shifts.
- Policy Augmentation during Training (PAT): During training, imagined rollouts are augmented by sampling multiple vehicle parameter sets (). The RL losses are averaged over both, and the latent state is conditioned on a context vector . This mechanism exposes the policy to distributions over vehicle dynamics during training, enhancing policy generalization to unseen vehicles.
5. Experimental Evaluation and Ablation
VDD was evaluated on MetaDrive environments (T-intersection, roundabout) using episodic reward (RW), route completion (RC), success rate (SR), and collision rate (CR):
| Method | RW (Roundabout) | RC | SR | CR |
|---|---|---|---|---|
| IDM | 82.2 | 0.54 | 0.21 | 0.59 |
| SAC | 96.1 | 0.62 | 0.30 | 0.62 |
| PPO | 73.4 | 0.53 | 0.01 | 0.64 |
| DreamerV3 | 150.6 | 0.84 | 0.53 | 0.40 |
| VDD | 163.8 | 0.87 | 0.70 | 0.26 |
VDD achieves the highest episodic reward and success rate, especially outperforming DreamerV3 on the roundabout task (17% higher SR). It also demonstrates improved control smoothness (lower action-change variance).
Ablation studies indicate that complete ego-state input, explicit feedback controller (PID over learned MLP), enabling the reachability penalty in the actor loss, and activating PAT all contribute to robust policy performance.
For robustness under parameter shifts, a grid with mass and steering angle scaling factors indicates:
- DreamerV3 is not robust to mass or steering changes.
- DreamerV3+PAD shows some mass robustness but is sensitive to steering reduction.
- VDD maintains stability with mass variation and benefits from increased steering range.
- VDD+PAT preserves near-optimal performance under severe steering reduction.
6. Significance and Implications
VDD introduces principled separation between ego-vehicle and environmental dynamics, permitting improved transfer and robustness in autonomous driving world models. PAD enables zero-shot correction to known dynamics shifts during deployment, while PAT gives the policy context-awareness and adaptability to unknown or variable parameter regimes. These innovations directly address the common failure modes of conventional world models, namely overfitting to training parameters and brittleness under hardware changes.
Comprehensive experiments demonstrate that VDD not only matches or surpasses state-of-the-art baselines in standard settings, but remains consistently performant under varied and adverse conditions. The modularized training objective and policy structure suggest broad applicability to other physical domains where agent-environment dynamics are separable or partially known (Li et al., 2 Dec 2025).