Joint-Embedding Predictive Architecture

Updated 20 February 2026

JEPA is a self-supervised learning framework that predicts target embeddings from context representations using energy-based modeling.
It employs dual encoders with a momentum-updated predictor and divergence measures to sculpt a quasimetric energy landscape.
The approach supports multiple modalities—including images, audio, video, and graphs—enhancing performance in tasks like reinforcement learning and generative modeling.

Joint-Embedding-Predictive Architecture (JEPA) refers to a class of self-supervised learning frameworks that eschew reconstruction losses in data space in favor of predictive coding within learned abstract representations. JEPAs are designed to learn energy-based models or compatibility functions between context and target embeddings, enabling highly semantic, compositional, and computationally efficient representation learning. The paradigm has been instantiated across a spectrum of modalities (images, video, audio, graphs, trajectories, vision–language), and further extended with intrinsic energy formulations, hierarchical and contrastive regularization, and integration of auxiliary tasks for structure-preserving encoding.

1. Architectural Foundations and Core Principles

At their essence, JEPAs operate by splitting an object (or state) into two “views” or partitions: a context portion and a target portion. The architecture employs:

A context encoder $f_c$ to obtain latent representation $z_x=f_c(x)$ for the context;
A target encoder $f_a$ (often an EMA or momentum copy of $f_c$ ) for the target $z_y=f_a(y)$ ;
A predictor $p_\theta$ mapping $z_x$ (and possibly side-information $c$ ) to a predicted target embedding $\hat z_y = p_\theta(z_x; c)$ ;
A comparator/divergence $D(\cdot, \cdot)$ (typically $\ell_2$ or SmoothL1) comparing $\hat z_y$ and $z_y$ .

The architecture induces an energy function: $E(x, y) = D(\hat z_y, z_y)$ where low energy signals that $x$ “predicts” $y$ well. Training sculpts an energy landscape $E:X\times X \to \mathbb{R}_+$ over the underlying state or input space. This formulation is universal: it can be instantiated for images (I-JEPA (Assran et al., 2023)), audio (A-JEPA (Fei et al., 2023), Audio-JEPA (Tuncay et al., 25 Jun 2025)), video, graphs (Graph-JEPA (Skenderi et al., 2023)), trajectories (T-JEPA (Li et al., 2024)), and multimodal problems (TI-JEPA (Vo et al., 9 Mar 2025), VL-JEPA (Chen et al., 11 Dec 2025)).

For certain JEPA classes, particularly those using intrinsic energy (least-action) energies, the energy function is not arbitrary but arises from minimizing an accumulated local cost along continuous (or graph-theoretic) trajectories between states. This yields an energy with compositional and often asymmetric structure (Kobanda et al., 12 Feb 2026).

2. Mathematical Underpinnings: Intrinsic Energies and Quasimetric Structure

An important theoretical advance is the connection of JEPA-induced energies to the geometry of quasimetric spaces. By restricting attention to intrinsic energy functions—those which correspond to the infimum of accumulated local effort along admissible trajectories between $x$ and $y$ —the JEPA framework induces: $E(x, y) = \inf_{\gamma \in \Gamma(x, y)} \int_0^1 L(\gamma(t), \dot\gamma(t))\,dt$ where $\Gamma(x, y)$ is the set of admissible trajectories from $x$ to $y$ and $L$ is a non-negative, coercive local effort function (Section 2, (Kobanda et al., 12 Feb 2026)). Under mild closure and additivity assumptions (concatenation of paths and additive actions), such $E(x,y)$ satisfies the axioms of a quasimetric:

Non-negativity and $E(x, x)=0$ (reflexivity)
Identity of indiscernibles ( $E(x,y)=0\implies x=y$ ) given coercivity
Triangle inequality: $E(x, z) \leq E(x, y) + E(y, z)$
Asymmetry: $E(x, y)\neq E(y, x)$ if the task or domain is direction-sensitive.

This construction is critical in domains with irreversible or one-way reachability. Symmetric energy functions fundamentally fail to distinguish directionality in reachability graphs or dynamic processes. For example, in navigation with obstacles, $E(a \rightarrow b)<\infty$ may hold, but the reverse can be infinite—encoding directed accessibility (Kobanda et al., 12 Feb 2026).

3. Connections to Reinforcement Learning and Value Geometry

The intrinsic-energy JEPA bridges self-supervised representation learning and goal-conditioned control by aligning with the value-theoretic structures of reinforcement learning. In goal-conditioned RL, the cost-to-go

$V^*(x, g) = \inf_{\gamma:\;x \rightarrow g} \int_0^1 c(\gamma(t), u(t))\,dt$

is precisely an intrinsic energy. Quasimetric RL (QRL) observes that the cost-to-go $d^*(x, g)=V^*(x, g)$ is a quasimetric, supporting propagation of value constraints across long horizons. Training a JEPA to model such intrinsic energies positions it within the quasimetric hypothesis class, leading to representations directly aligned with the cost-to-go geometry needed for planning and control (Kobanda et al., 12 Feb 2026).

4. Practical Instantiations Across Domains

Vision (I-JEPA, C-JEPA, EC-IJEPA): In image-based models, context and target blocks are sampled from images, and prediction is performed in representation space. Architecture typically combines a ViT backbone, masking strategies optimizing the scale and spatial distribution of context and target regions, and EMA-updated target encoders for stability (Assran et al., 2023, Mo et al., 2024, Littwin et al., 2024). Regularization (variance, invariance, covariance) prevents collapse and promotes distributive feature diversity (Mo et al., 2024).

Graphs (Graph-JEPA): Context and target subgraphs are encoded and predicted within a hyperbolic coordinate system to induce implicit hierarchical structure, enhancing expressiveness for graph-level tasks (Skenderi et al., 2023).

Multimodal and Sequential Data: The paradigm supports cross-modal energy-based alignment (TI-JEPA (Vo et al., 9 Mar 2025), VL-JEPA (Chen et al., 11 Dec 2025)) and can be extended to trajectories (T-JEPA (Li et al., 2024)), where masked prediction is performed over variable-length sub-trajectories, and to policies with joint prediction of future observation and action sequences (ACT-JEPA (Vujinovic et al., 24 Jan 2025)).

Generative Modeling (D-JEPA): JEPA has been unified with diffusion and flow-matching losses to produce state-of-the-art generative models for continuous data. The key procedure couples standard latent prediction with a per-token denoising (diffusion or flow-matching) loss, enabling autoregressive, set-wise generation in pixel, audio, or video space (Chen et al., 2024).

5. Theoretical and Empirical Properties

Implicit Bias to Predictive Features: JEPA, especially in deep linear regimes, exhibits a strong implicit bias toward “high-influence” features—those with large regression coefficients—rather than mere high variance as in generative autoencoding. The learning dynamics systematically prioritize features that are both predictable and discriminative, providing resistance to noise and uninterpretable variability (Littwin et al., 2024).
Collapse Prevention and Regularization: The use of asymmetric architecture (context-predictor-target division), momentum encoders, and explicit regularizers (variance, invariance, covariance) are critical for averting representational collapse and ensuring meaningful, non-trivial equilibria (Mo et al., 2024, Littwin et al., 2024).
Auxiliary Tasks and Disentanglement: Coupling JEPA objectives with auxiliary heads (e.g., reward regression, external phenomena) anchors the latent space, ensuring the encoding of precisely those distinctions essential for downstream utility. Theoretical results show that such auxiliary supervision guarantees no “unhealthy representation collapse” (distinct states forced together) in deterministic settings (Yu et al., 12 Sep 2025).
Connection to Koopman-Invariant Subspaces: In time-series and dynamical regimes, JEPA's objective aligns with the invariant subspace of the Koopman operator, with a linear predictor regularized toward identity inducing regime indicator functions and interpretable clustering by dynamical type (Ruiz-Morales et al., 12 Nov 2025).

6. Application Domains and Empirical Results

JEPAs have demonstrated broad and robust empirical performance, including:

ImageNet transfer: I-JEPA, C-JEPA, and EC-IJEPA perform on par or better than MAE and contrastive baselines, with additional benefits in sample efficiency, OOD transfer, and insensitivity to masking hyperparameters (Assran et al., 2023, Mo et al., 2024, Littwin et al., 2024).
Graph-level learning: Graph-JEPA outperforms contrastive and generative graph SSL frameworks in accuracy, regression, and efficiency (Skenderi et al., 2023).
Trajectory similarity: T-JEPA and HiT-JEPA achieve or exceed SOTA performance on trajectory search and robustness benchmarks through hierarchical, multi-scale masking and joint-layer prediction (Li et al., 2024, Li et al., 17 Jun 2025).
Audio and Multimodal: A-JEPA, Audio-JEPA, TI-JEPA, and VL-JEPA report competitive or superior results vs. supervised and contrastive baselines on sound event, sentiment, and vision-language benchmarks (Fei et al., 2023, Tuncay et al., 25 Jun 2025, Vo et al., 9 Mar 2025, Chen et al., 11 Dec 2025).
Generative models: D-JEPA achieves best-in-class FID on ImageNet conditional generation and scales efficiently in video, text-to-audio, and multimodal settings (Chen et al., 2024).

7. Limitations, Controversies, and Future Directions

Representation Collapse: Despite regularization, JEPAs may collapse in certain pathological regimes (e.g., fixed background distractors) (Sobal et al., 2022). Theoretical analyses highlight that without architectural or loss-based remedies (hierarchical prediction, auxiliary tasks), JEPA can discard all variable content in favor of constant, predictable features.
Symmetry vs Asymmetry: As evidenced theoretically and practically, symmetric energy functions preclude modeling one-way reachability or irreversibility. This has implications for the design of world models in control and planning: asymmetric, quasimetric compatibility scores are not merely a modeling choice but a necessity in directed environments (Kobanda et al., 12 Feb 2026).
Role of Predictor Architecture: The choice and constraints on the predictor (e.g., linearity, proximity to identity) directly impact interpretability, as in regime-indicator learning, and disentanglement of latent variables (Ruiz-Morales et al., 12 Nov 2025).
Further Extensions: Open problems include extending the framework to stochastic/partially observable environments, developing principled hierarchical JEPAs, integrating JEPA with end-to-end reinforcement learning, and leveraging multimodal or multi-task setups for more abstracted, modular representations.

The Joint-Embedding-Predictive Architecture paradigm defines a rigorous, flexible, and highly compositional approach to self-supervised representation learning, distinguished by its latent predictive coding, compatibility energy landscapes, and theoretical alignment with directed distance geometry and control. As the field advances, JEPAs continue to unify energy-based learning, structured prediction, and geometric inductive biases across increasingly diverse modalities and applications (Kobanda et al., 12 Feb 2026, Assran et al., 2023, Mo et al., 2024, Littwin et al., 2024, Vo et al., 9 Mar 2025, Chen et al., 11 Dec 2025).