Papers
Topics
Authors
Recent
2000 character limit reached

Vehicle Dynamics Embedded Dreamer (VDD)

Updated 9 December 2025
  • VDD is a model-based reinforcement learning framework that decouples ego-vehicle dynamics from environmental transitions to improve policy robustness.
  • It employs a latent state split into ego and environmental components with hierarchical context-aware modeling and latent space planning.
  • Robust strategies like PAD and PAT enable zero-shot adaptation to varied vehicle parameters, outperforming standard RL approaches in autonomous driving.

The Vehicle Dynamics Embedded Dreamer (VDD) is an extension of model-based reinforcement learning (RL) for autonomous driving, designed to improve policy robustness and generalization across vehicles with varying physical parameters. VDD achieves this by explicitly decoupling the modeling of ego-vehicle dynamics from environmental transition dynamics in the world model, while introducing targeted mechanics for handling shifts in vehicle physical parameters. This framework outperforms standard RL and world-model-based approaches, particularly in robustness to dynamic variations and smoothness of control, demonstrating enhanced applicability for real-world autonomous driving scenarios (Li et al., 2 Dec 2025).

1. Model Architecture and Latent Dynamics

VDD builds upon the Dreamer family by organizing the latent state into ego-vehicle and environmental components. High-dimensional sensory observations otOo_t \in \mathcal{O} (e.g., BEV images, LiDAR) and control actions atAa_t \in \mathcal{A} are mapped as follows:

  • stiRds^i_t \in \mathbb{R}^d: Low-dimensional ego-vehicle state (e.g., Δx,Δy,Δψ,v,ψ˙,v˙\Delta x, \Delta y, \Delta \psi, v, \dot{\psi}, \dot{v})
  • stes^e_t: Latent environmental state, parameterized by (ht,zt)(h_t, z_t), where hth_t is the RNN hidden state and ztz_t is a stochastic latent (categorical or Gaussian)
  • θ\theta: Ego-vehicle dynamics parameters (e.g., mass, wheelbase)

The hierarchical context-aware RSSM (hcRSSM) transitions at each time step as follows:

  1. Ego-dynamics update: sti=fθ(st1i,at)s^i_t = f_{\theta}(s^i_{t-1}, a_t)
  2. Environmental deterministic state: ht=GRUϕ(ht1,zt1,st1i)h_t = \mathrm{GRU}_{\phi}(h_{t-1}, z_{t-1}, s^i_{t-1})
  3. Prior over ztz_t: pϕ(ztht)=Categorical(MLPϕ(ht))p_{\phi}(z_t \mid h_t) = \mathrm{Categorical}(\mathrm{MLP}_{\phi}(h_t))
  4. Posterior inference: qϕ(ztht,ot)=Categorical(MLPϕ(ht,ot))q_{\phi}(z_t\mid h_t, o_t) = \mathrm{Categorical}(\mathrm{MLP}_{\phi}(h_t, o_t))

Observation, reward, and continue-flag decoders reconstruct o^t\hat{o}_t, r^t\hat{r}_t, and c^t\hat{c}_t from (ht,zt)(h_t, z_t). The total environmental loss Lenv(ϕ)\mathcal{L}_{\mathrm{env}}(\phi) combines prediction, prior consistency (KL), and representation losses. The vehicle dynamics model is optimized independently via MSE: Lveh(θ)=t=1T12fθ(st1i,at)sti2\mathcal{L}_{\mathrm{veh}}(\theta) = \sum_{t=1}^T \frac{1}{2} \|f_{\theta}(s^i_{t-1}, a_t) - s^i_t\|^2.

2. Decoupling of Ego-Vehicle and Environmental Dynamics

VDD factorizes world transitions, reflecting the insight that environmental transitions p(st+1este,sti)p(s^e_{t+1} \mid s^e_t, s^i_t) depend on the ego-vehicle state but not directly on the action, while ego-vehicle transitions p(st+1isti,at;θ)p(s^i_{t+1}\mid s^i_t, a_t; \theta) depend only on intrinsic vehicle parameters:

p(st+1st,at)=p(st+1este,sti)×p(st+1isti,at;θ)p(s_{t+1} \mid s_t, a_t) = p(s^e_{t+1} \mid s^e_t, s^i_t) \times p(s^i_{t+1} \mid s^i_t, a_t; \theta)

The environmental model (RSSM, parameterized by ϕ\phi) and the vehicle model (either physics-based or learned, parameterized by θ\theta) are optimized independently, leading to the additive objective:

L(ϕ,θ)=Lenv(ϕ)+Lveh(θ)\mathcal{L}(\phi, \theta) = \mathcal{L}_{\mathrm{env}}(\phi) + \mathcal{L}_{\mathrm{veh}}(\theta)

This separation allows the world model to generalize across vehicles with varied physical dynamics.

3. Policy Learning and Latent Space Planning

VDD adopts Dreamer-style actor-critic reinforcement learning with policy learning occurring entirely in latent space. The components are:

  • Goal Generator (Actor): πϵ(gtste)\pi_\epsilon(g_t \mid s^e_t) defines a distribution over goals gtg_t conditioned on environment state.
  • Controller: at=uκ(gt,sti)a_t = u_\kappa(g_t, s^i_t) is a fixed feedback controller (e.g., PID), tracking the generated goal given the current ego-vehicle state.
  • Critic: vξ(ste)E[τ0γτrt+τste]v_\xi(s^e_t) \approx \mathbb{E}[\sum_{\tau \geq 0} \gamma^\tau r_{t+\tau} \mid s^e_t] estimates return in the latent environment.

Imagined rollouts in latent space simulate KK futures; at each step, gt+1πϵ(s^t+1e)g_{t+1}\sim \pi_\epsilon(\cdot\mid \hat{s}^e_{t+1}), at+1=uκ(gt+1,s^t+1i)a_{t+1}=u_\kappa(g_{t+1}, \hat{s}^i_{t+1}), r^t+1pϕ(rs^t+1e)\hat{r}_{t+1}\sim p_\phi(r\mid \hat{s}^e_{t+1}), and c^t+1pϕ(cs^t+1e)\hat{c}_{t+1}\sim p_\phi(c\mid \hat{s}^e_{t+1}). Returns are computed as λ\lambda-returns. Critic and actor objectives include two-hot regression, entropy regularization, reward maximization, and a reachability penalty that encourages generating reachable goals in hh steps.

4. Robustness Strategies: PAD and PAT

VDD introduces two targeted mechanisms to improve robustness to changing vehicle parameters:

  • Policy Adjustment during Deployment (PAD): For vehicles with dynamics θ\theta', PAD maps each action ata_t learned under training parameters θ\theta to a corresponding ata'_t such that the next ego-state under θ\theta' matches the planned next state under θ\theta:

sti=fθ(st1i,at),at=fθ1(st1i,sti)s^i_t = f_{\theta}(s^i_{t-1}, a_t), \quad a'_t = f_{\theta'}^{-1}(s^i_{t-1}, s^i_t)

Here, fθ1f_{\theta'}^{-1} is the inverse dynamics mapping of the new vehicle. This preserves policy intent despite parameter shifts.

  • Policy Augmentation during Training (PAT): During training, imagined rollouts are augmented by sampling multiple vehicle parameter sets (θ,θ\theta, \theta'). The RL losses are averaged over both, and the latent state is conditioned on a context vector ctxθ=θ/θctx_{\theta'} = \theta' / \theta. This mechanism exposes the policy to distributions over vehicle dynamics during training, enhancing policy generalization to unseen vehicles.

5. Experimental Evaluation and Ablation

VDD was evaluated on MetaDrive environments (T-intersection, roundabout) using episodic reward (RW), route completion (RC), success rate (SR), and collision rate (CR):

Method RW (Roundabout) RC SR CR
IDM 82.2 0.54 0.21 0.59
SAC 96.1 0.62 0.30 0.62
PPO 73.4 0.53 0.01 0.64
DreamerV3 150.6 0.84 0.53 0.40
VDD 163.8 0.87 0.70 0.26

VDD achieves the highest episodic reward and success rate, especially outperforming DreamerV3 on the roundabout task (17% higher SR). It also demonstrates improved control smoothness (lower action-change variance).

Ablation studies indicate that complete ego-state input, explicit feedback controller (PID over learned MLP), enabling the reachability penalty in the actor loss, and activating PAT all contribute to robust policy performance.

For robustness under parameter shifts, a 5×55\times5 grid with mass and steering angle scaling factors {0.5,0.75,1.0,1.25,1.5}\{0.5,0.75,1.0,1.25,1.5\} indicates:

  • DreamerV3 is not robust to mass or steering changes.
  • DreamerV3+PAD shows some mass robustness but is sensitive to steering reduction.
  • VDD maintains stability with mass variation and benefits from increased steering range.
  • VDD+PAT preserves near-optimal performance under severe steering reduction.

6. Significance and Implications

VDD introduces principled separation between ego-vehicle and environmental dynamics, permitting improved transfer and robustness in autonomous driving world models. PAD enables zero-shot correction to known dynamics shifts during deployment, while PAT gives the policy context-awareness and adaptability to unknown or variable parameter regimes. These innovations directly address the common failure modes of conventional world models, namely overfitting to training parameters and brittleness under hardware changes.

Comprehensive experiments demonstrate that VDD not only matches or surpasses state-of-the-art baselines in standard settings, but remains consistently performant under varied and adverse conditions. The modularized training objective and policy structure suggest broad applicability to other physical domains where agent-environment dynamics are separable or partially known (Li et al., 2 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Vehicle Dynamics Embedded Dreamer (VDD).