Cause of degraded action mapping when using latent policy outputs

Ascertain whether distribution mismatch or higher noise in the CONTEXT module’s latent policy head, relative to the latent inverse dynamics head, causes the observed degradation in downstream action mapping performance when mapping per-particle latent actions to global actions in the Latent Particle World Model (LPWM), and characterize the underlying mechanism of this discrepancy.

Background

In LPWM’s post-hoc imitation learning setup, per-particle latent actions are mapped to environment actions using a compact attention-pooling network. Empirically, the authors observe that training the mapping on latent actions inferred by the inverse dynamics head yields better performance than using latent policy samples.

They hypothesize that this performance gap may stem from a distribution mismatch between training and inference signals or higher noise in the latent policy predictor, but the precise reason is left unresolved and identified as a question for future investigation.

References

Notably, we empirically found that directly using the latent policy outputs for mapping degrades downstream performance; the mapping network performs best when evaluated on the outputs of the latent inverse module, as this matches the distribution seen during training. The difference may be due to distribution mismatch or higher noise from the latent policy predictor—a question we leave for future investigation.

Latent Particle World Models: Self-supervised Object-centric Stochastic Dynamics Modeling  (2603.04553 - Daniel et al., 4 Mar 2026) in Appendix A.5, Policy Learning with Latent Particle World Models