JEPA-WMs: Joint-Embedding Predictive Models
- JEPA-WMs are self-supervised, reconstruction-free architectures that learn latent representations of dynamic environments via predictive embedding objectives.
- They leverage a lightweight predictor and momentum target encoder, grounded in Koopman operator theory, to discover and cluster dynamical regimes.
- JEPA-WMs extend to continuous-time, multimodal, and spatial domains, offering robust low-dimensional state estimation and effective planning capabilities.
Joint-Embedding Predictive World Models (JEPA-WMs) are a class of self-supervised, reconstruction-free architectures for learning latent representations of dynamic environments through predictive embedding objectives. JEPA-WMs operate by encoding partial or sequential observations into a shared latent space and training a lightweight predictor to map these representations forward in time or across masked/spatial partitions, bypassing pixel-level reconstruction. The formalism enables unsupervised segmentation of regimes, continuous-time state-space learning, multimodal fusion, and robust planning, tying together perspectives from Koopman operator theory, embedding regularization, and dynamical systems.
1. Core Architecture and Predictive Objective
A canonical JEPA-WM comprises three main components: an online encoder , a predictor (often linear), and a target encoder (momentum-averaged copy of ). Given observation windows , the model computes , predicts , and compares against . The loss minimized is
where under the EMA-tracking approximation. This structure is agnostic to the observation modality: pixel frames, LiDAR point clouds, aerial imagery, multimodal geospatial tokens, or stacked image/action pairs (Ruiz-Morales et al., 12 Nov 2025, Zhu et al., 9 Jan 2025, Ulmen et al., 14 Aug 2025, Kenneweg et al., 23 Apr 2025, Lundqvist et al., 25 Feb 2025).
There is no explicit decoder; all learning is confined to the evolution of latent codes, focusing the model capacity on features that are dynamically predictive rather than reconstructively detailed.
2. Theoretical Foundations: Koopman Operator and Invariant Subspaces
The emergent clustering of time-series regimes in JEPA-WMs can be explained by dynamical systems theory, particularly the Koopman operator framework. For a discrete set of ergodic regimes (distribution , each supported on ), under a linear predictor and latent dimension (where is the number of regimes), the global minimum of the JEPA loss is achieved when the encoder spans the invariant subspace of the -step Koopman operator , namely the regime-indicator eigenfunctions . The principal theorem is:
if and only if (i) the image of lies in and (ii) acts as identity on this subspace. Thus, the encoder recovers regime-indicator projections, yielding interpretable clustering of latent representations according to dynamical regime membership (Ruiz-Morales et al., 12 Nov 2025).
This phenomenon is robust to invertible linear transformations of the regime subspace, but a near-identity initialization and light -regularization on the predictor biases the optimizer towards a disentangled, interpretable basis aligned with the actual regimes.
3. Practical Design Principles and Loss Variants
A well-structured JEPA-WM leverages several core recipes:
- Predictor parameterization: Use , initialize , and optionally add a Frobenius norm regularizer to keep near identity (Ruiz-Morales et al., 12 Nov 2025).
- Latent-space capacity: Ensure to represent all discovery regimes.
- Momentum target encoder: Use , updated as , to stabilize and decorrelate update targets.
- Loss structure: For continuous systems, combine a predictive loss with contractive () and Lipschitz () regularizers:
The contractive loss penalizes the Jacobian of the encoder, promoting locally isometric embeddings; the Lipschitz penalty bounds the local Jacobian norm of the transition network (Ulmen et al., 14 Aug 2025).
- Auxiliary tasks: Supplementing with a supervised or task-relevant regression head () anchors the representation, enriching it with distinctions that dynamics alone may not encode. The No Unhealthy Representation Collapse theorem guarantees that, if both the transition and auxiliary losses are minimized, no pair of non-equivalent states collapses in the latent space (Yu et al., 12 Sep 2025).
4. Extensions to Continuous-Time, Multimodal, and Spatial Domains
JEPA-WMs generalize seamlessly to continuous-time dynamics, multimodal data, and spatial prediction:
- Continuous-time ODEs: The embedding is made to evolve by a learned neural ODE , with the predictor integrating this ODE to match future encoded observations. Losses enforce both local contractivity and global Lipschitz constraints, structuring the latent space for robust downstream control (Ulmen et al., 14 Aug 2025).
- Multimodal/masked world modeling: JEPA masking strategies (e.g., BEV grids in LiDAR, tokens for geospatial tiles) remove the need for hand-crafted positives/negatives. Predicting masked targets directly in embedding space both enables uncertain region modeling and prevents augmentation or pretext bias, as in GeoJEPA (Lundqvist et al., 25 Feb 2025) and AD-L-JEPA (Zhu et al., 9 Jan 2025).
- Spatial and temporal planning: Action-conditioned predictors (transformer or MLP) enable sequence-level rollouts for planning, with optimization over latent trajectories (using CEM, NGOpt, or gradient descent), leveraging compact, abstract representations for goal-directed control (Terver et al., 30 Dec 2025, Assran et al., 11 Jun 2025).
5. Empirical Properties and Applications
JEPA-WMs exhibit several empirically validated properties:
- Unsupervised regime clustering: Encoders recover regime-indicator coordinates, segmenting time series by underlying dynamics without supervision (Ruiz-Morales et al., 12 Nov 2025).
- Latent disentanglement: Near-identity constraint on the predictor selects an interpretable, axis-aligned latent regime basis, reducing degeneracy in discovered representations.
- Robust low-dimensional state estimation: In classic physical systems (e.g., a pendulum), only a small subset of latent dimensions correlate with the physically meaningful state variables, with the rest absorbing high-order or regularization-induced noise (Ulmen et al., 14 Aug 2025).
- Efficient label transfer and detection: In LiDAR-based tasks, JEPA-WMs pretraining yields higher average precision and better label efficiency compared to generative or contrastive baselines (Zhu et al., 9 Jan 2025).
- Superior planning performance: In complex navigation and manipulation tasks, JEPA-WMs outperform pixel-based and other latent-planner baselines, provided rollout losses and input context are tuned appropriately (Terver et al., 30 Dec 2025, Assran et al., 11 Jun 2025).
6. Limitations, Failure Modes, and Design Considerations
The predictive alignment in JEPA-WMs biases the encoder to learn slow features—variables that remain most stable over prediction intervals. As a consequence, fixed background distractors can dominate, causing the model to ignore more informative but rapidly changing object features (Sobal et al., 2022). Techniques to avoid this include:
- Temporal differencing or feeding optical flow to eliminate stationary nuisance modes.
- Aggressive data augmentation to disrupt static correlations.
- Including auxiliary losses or carefully designed regularizers to anchor task-relevant fast features (Yu et al., 12 Sep 2025, Sobal et al., 2022).
Architectural pitfalls include insufficient encoding capacity (preventing regime separation), non-identity initialization of the predictor (causing basis entanglement), or absence of joint auxiliary tasks (leading to unwanted collapse).
7. Broader Impact and Future Directions
JEPA-WMs unify concepts from dynamical systems, representation learning, and model-based control:
- The link to Koopman operator invariants provides theoretical grounding for unsupervised phase segmentation.
- Action-conditioned and continuous-time variants place JEPA-WMs as central tools for model-based reinforcement learning and robotics.
- Multimodal instantiations extend the paradigm's reach to geospatial, spatial, and sensory-rich environments, with state-of-the-art performance on real-world planning and detection tasks (Terver et al., 30 Dec 2025, Zhu et al., 9 Jan 2025, Lundqvist et al., 25 Feb 2025).
- Future work aims to generalize Koopman-driven clustering to systems without hard regime segmentation, devise scalable architectures for natural video and web-scale data, and robustly disentangle informative features in high-noise, high-heterogeneity domains (Terver et al., 30 Dec 2025, Assran et al., 11 Jun 2025).
In summary, Joint-Embedding Predictive World Models provide a principled, flexible, and theoretically grounded approach for learning low-dimensional, dynamically meaningful representations directly from high-dimensional or multimodal sensory data. Their design leverages self-supervised prediction in embedding space, naturally discovers dynamical regimes, and enables integrable planning and control across diverse tasks and domains (Ruiz-Morales et al., 12 Nov 2025, Ulmen et al., 14 Aug 2025, Zhu et al., 9 Jan 2025, Terver et al., 30 Dec 2025, Sobal et al., 2022).