Energy-Aware JEPA: Joint Embedding Predictive Architecture

Updated 22 May 2026

Energy-Aware JEPA is a self-supervised framework that learns structured, modular representations by predicting target embeddings and minimizing an energy-based loss.
It employs dual encoders and a predictor to capture abstract, transferable features without reconstructing every low-level detail across diverse data modalities.
The framework enforces physically meaningful constraints, linking to optimal control and quasimetric structures, which enhances its effectiveness in goal-conditioned prediction and transfer learning.

An Energy-Aware Joint Embedding Predictive Architecture (JEPA) is a self-supervised framework that learns structured, modular representations in a latent space by predicting target embeddings from context embeddings and minimizing an energy-based compatibility function. The architecture is applicable across diverse domains—including images, text, video, action-conditioned agents, and physics—by exploiting explicit or implicit energy signals for alignment, regularization, and goal-conditioned prediction. This paradigm departs from pixel- or instance-level generative modeling, enabling the capture of abstract, transferable features without reconstructing every low-level detail. The energy-based viewpoint endows JEPA with theoretical connections to optimal control, quasimetric structure, and goal-reaching via least-action or intrinsic energy minimization.

1. Core Principles and Formalism

The fundamental elements of a JEPA are two parametric encoders (context and target), a predictor, and an explicit latent-space distance (energy) function. Given data points $x, y \in X$ , the context encoder $f_o: X \to \mathbb{R}^d$ and target encoder $f_a: X \to \mathbb{R}^d$ map inputs to latent embeddings $z_c, z_t$ , typically in a high-dimensional space. The predictor $p: \mathbb{R}^d \to \mathbb{R}^d$ attempts to map from the context embedding to a prediction of the target embedding: $z_c = f_o(x), \quad z_t = f_a(y), \quad \hat{z}_t = p(z_c)$ The induced scalar energy is

$E(x, y) = D(\hat{z}_t, z_t)$

where $D$ is a latent-space comparator; standard choices include the squared Euclidean norm or SmoothL1 loss.

The JEPA is trained to minimize expected energy for compatible pairs $(x, y)$ , analogously to minimizing a self-supervised prediction error, such as: $\mathcal{L}_\text{JEPA} = \mathbb{E}_{(x, y) \sim \mathrm{data}}\; D(p(f_o(x)), f_a(y))$ This structure allows JEPAs to operate without decoders, contrastive negatives, or pixel-wise metrics, making them scalable and robust across modalities (Terver et al., 3 Feb 2026, Vo et al., 9 Mar 2025, Bardhan et al., 6 Feb 2025).

2. Energy Awareness and Intrinsic Energy

JEPA can be rendered "energy-aware" by explicitly incorporating domain-specific energy signals, ensuring that latent representations reflect physically or semantically meaningful quantities. For example, in high-energy physics, the HEP-JEPA model incorporates per-particle energies and normalized ratios into both input features and transformer attention biases: $f_o: X \to \mathbb{R}^d$ 0 This energy normalization ensures features are O(1) and robust in high-energy collider domains, while attention is biased by kinematic quantities such as $f_o: X \to \mathbb{R}^d$ 1 and pairwise mass (Bardhan et al., 6 Feb 2025). In multimodal tasks, energy functions operationalize the squared L2 error between predicted and true target embeddings for each masked region, which plays the role of a cross-modal compatibility metric (Vo et al., 9 Mar 2025).

A principled extension is the intrinsic (least-action) energy: for a general state space $f_o: X \to \mathbb{R}^d$ 2, the energy between two points is the infimum accumulated effort along all admissible trajectories (paths) connecting them: $f_o: X \to \mathbb{R}^d$ 3 where $f_o: X \to \mathbb{R}^d$ 4 is a continuous, coercive local cost or Lagrangian, and $f_o: X \to \mathbb{R}^d$ 5 denotes all admissible paths from $f_o: X \to \mathbb{R}^d$ 6 to $f_o: X \to \mathbb{R}^d$ 7. This intrinsic energy forms a quasimetric on $f_o: X \to \mathbb{R}^d$ 8, under closure and additivity assumptions (Kobanda et al., 12 Feb 2026).

3. Architectural Instantiations Across Domains

JEPAs have been applied in a variety of domains, each leveraging different modalities and regularization strategies:

Model	Input Modalities	Prediction Target	Energy Function
TI-JEPA	Image (ViT), Text (BERT++)	Masked image patches	L2 error
EB-JEPA	Image, Video, Control	Augmented, temporal, or action-predicted latents	L2 error + reg.
HEP-JEPA	Collimated jets (particles)	Masked/sampled jet patches	SmoothL1 error

TI-JEPA uses a dual-encoder (ViT-H for images, Transformer++ for text) and stacked cross-attention modules (Small/Medium/Large) to align image-text pairs. Target masking selects random blocks, and squared error between predicted and target embeddings forms the energy minimized in pretraining. Pretrained model achieves state-of-the-art on sentiment benchmarks, outperforming CLIP-CA-CG and similar architectures (Vo et al., 9 Mar 2025).
EB-JEPA generalizes JEPA to static images (view invariance), videos (temporal prediction), and control (action-conditioned state prediction) using modular PyTorch components. Key regularization techniques include VICReg and SIGReg to prevent representation collapse, and the architecture captures view-invariance, temporal coherence, and action-responsivity in learned representations (Terver et al., 3 Feb 2026).
HEP-JEPA adapts the framework to jet physics, utilizing energy-aware features, transformer encoders with physics-informed attention, and SmoothL1 energy losses on masked jet patches. Register tokens and attention biases specifically encode physics priors, yielding robust performance on JetClass and other collider benchmarks (Bardhan et al., 6 Feb 2025).

4. Self-Supervised Training, Masking, and Regularization

JEPA training is typically self-supervised: for each data point, a subset of the signal is selected (“context”), and a disjoint subset (“target”) is masked and predicted. In TI-JEPA and HEP-JEPA, images or jets are divided into patches, with blocks masked at empirically optimized scales (context: 85–100%, target: 15–20%). The predictor is supplied with context embeddings and mask tokens and outputs predictions for the masked target embeddings.

Unlike contrastive learning, JEPAs avoid negative sample mining or explicit contrastive loss. To prevent trivial (collapsed) solutions, various regularizers are used:

Variance and covariance regularization (VICReg): encourage feature spread and decorrelation in latent space (Terver et al., 3 Feb 2026).
Gaussianity-based regularization (SIGReg): test for non-degenerate representations along random projections (Terver et al., 3 Feb 2026).
Temporal smoothness and inverse dynamics for video/control domains (Terver et al., 3 Feb 2026).
Momentum encoder averaging and separation of context/target weights (HEP-JEPA) to maintain representation diversity (Bardhan et al., 6 Feb 2025).

Empirically, ablations confirm that regularization is essential: removing regularizers can decrease linear probing accuracy by 3–4 points or even cause collapse in action-conditioned tasks.

5. Theoretical Structure: Quasimetrics and Goal-Conditioned Control

JEPA's induced energy landscape holds rigorous structure. For intrinsic-energy JEPAs, the learned energy function satisfies the axioms of a quasimetric:

Reflexivity: $f_o: X \to \mathbb{R}^d$ 9.
Nonnegativity: $f_a: X \to \mathbb{R}^d$ 0.
Identity of indiscernibles: $f_a: X \to \mathbb{R}^d$ 1.
Triangle inequality: $f_a: X \to \mathbb{R}^d$ 2.

If the energy is derived from least-action principles (sum/integral of local costs), this structure corresponds exactly to optimal cost-to-go in goal-conditioned reinforcement learning. Any symmetric (metric) energy function cannot model fundamentally one-way reachability: finite symmetric energies imply the reachability relation is symmetric; thus, asymmetric (quasimetric) energies are required in domains with irreversibility or directionality (e.g., control with friction, dissipative dynamics) (Kobanda et al., 12 Feb 2026).

6. Downstream Performance and Task Transfer

JEPA models pretrained on unlabelled data yield strong, sample-efficient transferable representations, as measured by linear probing or fine-tuning on task-specific heads:

TI-JEPA: Achieves SOTA on MVSA-Single and MVSA-Multi sentiment analysis, with TI-JEPA-Large attaining 76.75%/74.62% Acc/F1 (Single) and 77.55%/75.02% (Multi), surpassing CLIP-CA-CG and ITIN (Vo et al., 9 Mar 2025).
HEP-JEPA: In JetClass, fine-tuned accuracy at 0.05% labels improves from 0.505 (from scratch) to 0.564 with JEPA pretraining; in top quark tagging and quark/gluon discrimination, performance is competitive with best-in-class models (ParticleNet, ParT). Key gains are large few-shot improvements and faster convergence (Bardhan et al., 6 Feb 2025).
EB-JEPA: Linear accuracy of 91% on CIFAR-10 with SIGReg regularization, multi-step video rollouts on Moving MNIST, and 97% planning success in world model navigation, with ablations underscoring the necessity of each regularizer component (Terver et al., 3 Feb 2026).

This evidence suggests that energy-based JEPA induces generalizable, robust features suited to diverse tasks—highlighting transfer capabilities surpassed by only the most highly specialized supervised models.

7. Practical Considerations, Limitations, and Future Directions

State-of-the-art JEPAs feature large transformer backbones and cross-modal attention, but use strategies such as freezing pretrained encoders, employing token register/tracker schemes, and tuning mask ratios to balance prediction difficulty. Batch sizes, optimizer schedules (AdamW, momentum), and masking blocks are selected by cross-validation. Modular implementations, as in EB-JEPA, enable training on research hardware in modest timescales (2–12 hours per run on 16 GB GPU) (Terver et al., 3 Feb 2026).

Recognized limitations include modality restriction (e.g., TI-JEPA is bimodal only), potential domain-specific overfitting (COCO-only pretraining), lack of modality-isolated ablations, and untested applications beyond classification (e.g., VQA for TI-JEPA). Future directions include extension to more general or higher-order modalities (audio, video), inclusion of explicit contrastive or MCMC-driven energy terms, and systematic study of symmetric versus asymmetric energies in diverse dynamical regimes (Vo et al., 9 Mar 2025, Kobanda et al., 12 Feb 2026).

A plausible implication is that as the understanding of energy-induced quasimetrics in latent spaces grows, JEPAs are likely to serve as unifying architectures connecting self-supervised learning, optimal planning, and structured prediction across domains characterized by directionality, cost, or physical constraints.