Sensorimotor World Models: Perception for Action via Inverse Dynamics

Published 18 Jun 2026 in cs.LG and cs.AI | (2606.20104v1)

Abstract: Perception for action suggests that representations of the world should be shaped not by visual fidelity alone, but by their relevance for actions. At the same time, latent JEPA-style world models advocate learning compact predictive states from high-dimensional observations to facilitate the prediction of future states, but end-to-end training of these models is nontrivial because representations may collapse if our only goal is to construct a latent state that is easy to predict. We introduce a sensorimotor world model (SMWM): a latent world model trained end-to-end with inverse dynamics regularization. This single regularizer addresses both issues: it prevents representation collapse and induces action-aligned representations. By forcing latent states to preserve information about the action underlying a transition, it biases the model toward the controllable degrees of freedom of the environment while discarding uncontrollable distractors. This yields stable latent world models trained from offline, reward-free trajectories, without frozen encoders, exponential moving averages, or complex latent regularizers. Empirically, SMWM learns compact, interpretable latent spaces and enables competitive planning performance across simple 2D and 3D control tasks.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper demonstrates that integrating inverse dynamics in JEPA-style latent models prevents encoder collapse and yields action-aligned representations.
Empirical results show that SMWM recovers intrinsic state dimensions by filtering distractors, achieving up to 84% planning success in complex tasks.
The method provides a minimal anti-collapse mechanism without reconstruction losses, enabling effective control in diverse 2D and 3D environments.

Sensorimotor World Models: Perception for Action via Inverse Dynamics

Motivation and Problem Statement

Traditional world models in RL and control seek to predict future observations or states given actions, often relying on pixel-level reconstruction or reward signals. This approach, however, suffers from high-dimensionality, irrelevant detail retention, and representation collapse: the encoder may degenerate to mapping all observations to a constant embedding, trivializing dynamics prediction and rendering the model useless for control. Recent JEPA-style latent models transfer prediction directly to the embedding space, but require elaborate anti-collapse mechanisms such as frozen encoders, auxiliary distributional regularizers (e.g., SIGReg, VICReg), or exponential-moving-average targets. The paper proposes sensorimotor world models (SMWM), advancing a principled approach rooted in perception-for-action: world representations should encode only controllable degrees of freedom, discarding action-irrelevant distractors.

Methodology: Inverse Dynamics Regularization

SMWM comprises an encoder $f_\theta$ , a forward dynamics model $g_\phi$ , and an inverse dynamics model $h_\psi$ , all jointly trained from offline, reward-free trajectories containing $(o_t, a_t, o_{t+1})$ tuples. The forward model is supervised with mean-squared error between predicted and true next embeddings, while the inverse model predicts executed actions from consecutive embeddings, also under mean-squared error. The joint objective

$\mathcal{L} = \mathcal{L}_{\text{fwd}} + \lambda\,\mathcal{L}_{\text{inv}}$

ensures the encoder receives both forward and inverse gradients, preventing collapse and inducing action-aligned representations. The inverse loss anchors the representation to action-relevant structure without imposing geometric priors on the embedding space.

Figure 1: SMWM architecture, showing encoder, forward, and inverse dynamics models, each interacting with the offline dataset.

Empirical Analysis: Latent Structure and Controllability

In controlled experiments using dot-world environments (single/multiple dots with varying controllable and distractor degrees), SMWM reliably recovers the intrinsic state dimension:

PCA spectra of learned embeddings drop sharply past the true controllable dimension.
Controllable states are faithfully encoded; uncontrollable distractors are filtered out.
The learned latent space mirrors spatial geometry and maintains neighborhood structure.
Figure 2: Dot-world PCA spectrum and spatially faithful latent encodings; the spectrum shows collapse after the true intrinsic dimension.

In the absence of inverse loss, the encoder collapses entirely, confirming the necessity of action prediction as an anti-collapse mechanism.

Commutativity and Latent Action Geometry

The encoder and forward model approximately commute: encoding after action application coincides with applying the learned latent intervention. This establishes approximate equivariance, and empirical analysis demonstrates that forward rollouts in latent space track encoded ground-truth trajectories. Actions manifest as latent translations, visualized by autoregressive rollouts and corresponding embeddings.

Figure 3: Joint encoder and forward model commute, with rollouts in latent space tracking ground-truth embeddings.

Filtering Uncontrollable Distractors

Varying environment structure yields a consistent correspondence between effective latent dimension and controllable degrees of freedom, regardless of distractor prevalence. The encoder's variance allocation matches the true action span and disregards stochastic distractors.

Figure 4: PCA spectra for environments with multiple dots and distractors; latent dimension matches controllable structure.

A further visualization using asymmetric sprite agents with varied action interfaces shows the encoded representation systematically preserves only controlled pose variables, averaging over uncontrolled ones.

Figure 5: Sprite reconstructions with different action interfaces; controlled variables are preserved, uncontrolled ones averaged.

Goal-Conditioned Planning and Downstream Control

Planning is performed by encoding start and goal observations, rolling out the learned forward dynamics model in latent space, and optimizing action sequences with CEM. SMWM matches or outperforms SIGReg regularization on diverse environments: 2D navigation, contact-rich manipulation, continuous control, and high-dimensional 3D tabletop manipulation. The largest margin emerges in OGBench-Cube, a complex 3D manipulation task, where SMWM achieves an 84% planning success rate versus SIGReg's 59%.

Figure 6: Planning success rates across environments; SMWM matches or surpasses SIGReg, particularly in 3D tasks.

Robustness to planning horizon shows stable performance as goal offset increases, with SMWM maintaining success across longer rollouts in TwoRoom and OGBench-Cube.

Figure 7: Goal-conditioned planning robustness against horizon.

Latent Geometry and Physical State Probing

Analyzing held-out embeddings and their PCA spectra confirms SMWM's compactness: dominant axes align with controllable state variables (e.g., Cartesian coordinates, angles). Ground-truth physical quantities are linearly or nonlinearly recoverable from embeddings. SIGReg baselines, in contrast, allocate variance across many PCs and are less interpretable.

Figure 8: SMWM embeddings' PCA spectra and subspace projections; geometry mirrors environment state topology.

Practical Implications and Theoretical Insights

SMWM provides a minimal yet effective anti-collapse mechanism with a single hyperparameter, free from distributional priors and reconstruction loss. Empirical results indicate superior interpretability, compactness, and downstream control performance in both 2D and 3D environments. The model aligns with causal representation learning and enactive perception principles, organizing latent spaces by controllable structure. Approximate equivariance and latent interventions afford future theoretical investigation into group-structured representations, compositionality, and causal abstraction.

Limitations and Future Work

SMWM assumes actions are recoverable from consecutive observations; failures may arise when action-induced changes are invisible or ambiguous. The current model does not encode velocity or history-dependent quantities; incorporating observation histories and multi-step inverse objectives is a potential extension. Biased behavioral policies may yield action-correlated distractors, a challenge for the encoder. Long-horizon planning remains susceptible to compounding error; more robust data regimes and integration of inverse heads into hierarchical planners are natural future directions.

Conclusion

SMWM demonstrates that inverse dynamics regularization suffices to stabilize and organize JEPA-style latent world models. The learned representations are compact, interpretable, track controllable degrees of freedom, filter out distractors, and yield competitive planning performance across varied tasks. The work indicates that perception-for-action principles can be effectively instantiated in latent representation learning, advancing both practical world modeling and theoretical understanding of sensorimotor contingencies in intelligent agency.