Latent-Euclidean JEPA (LeJEPA)

Updated 12 November 2025

LeJEPA is a self-supervised learning paradigm that leverages a theoretically anchored JEPA framework with a novel isotropic Gaussian regularizer to optimize embedding distributions.
SIGReg enforces desired embedding properties through statistical tests on random projections, eliminating reliance on common heuristics like stop-gradients or EMA teachers.
Empirical evaluations show that LeJEPA achieves higher accuracy with fewer epochs and demonstrates robust scalability across various architectures and datasets.

Latent-Euclidean JEPA (LeJEPA) is a self-supervised learning paradigm based on a theoretically anchored instantiation of the Joint-Embedding Predictive Architectures (JEPAs) framework. LeJEPA addresses the challenge of learning manipulable representations of data by optimizing embedding distributions for minimal downstream prediction risk, eliminating widespread heuristics in self-supervised learning, and providing robust performance and scaling properties.

1. Joint-Embedding Predictive Architectures and the LeJEPA Formulation

Joint-Embedding Predictive Architectures (JEPAs) are predicated on learning an encoder $f_\theta: x \mapsto z \in \mathbb{R}^K$ such that embeddings of multiple “views” of the same sample are predictive of each other while avoiding representational collapse. Formally, for multiple augmented views $\{x_{n,v}\}_{v=1}^V$ of instance $n$ , JEPAs minimize: $\min_\theta\; \frac{1}{N V} \sum_{n=1}^N \sum_{v \neq v'} \| \mathrm{Pred}(f_\theta(x_{n,v})) - f_\theta(x_{n,v'}) \|^2_2, \quad \text{subject to non-degenerate embeddings}.$ Traditional JEPA methods typically address non-degeneracy through heuristics such as stop-gradients, exponential moving average (EMA) teacher encoders, whitening, negative sample mining, or custom schedulers.

LeJEPA departs from these conventions by introducing a regularization mechanism to enforce a specific embedding distribution, thus obviating the need for ad hoc interventions. It employs (a) a squared distance prediction loss between each view and a centroid over "global" views, and (b) a novel distribution-matching regularizer that enforces isotropic Gaussianity in the learned embeddings.

2. Theoretical Optimality of Isotropic Gaussian Embeddings

The LeJEPA framework is grounded in the insight that the isotropic Gaussian is the unique optimal embedding distribution for minimizing worst-case prediction risk across downstream tasks, under a fixed total variance constraint.

Linear Probing Analysis

Let $Z \in \mathbb{R}^{N \times K}$ denote the embeddings with covariance $\Sigma$ , and consider the ridge regression solution: $\hat{\beta} = \arg\min_\beta \| y - Z\beta \|^2_2 + \lambda \| \beta \|^2_2.$ If $\Sigma$ is anisotropic, there exist target vectors $y$ where the bias in $\mathbb{E}[\hat{\beta}]$ is greater than for isotropic $\Sigma \propto I$ . Additionally, the OLS variance $\mathrm{Tr}(\mathrm{Var}(\hat{\beta})) = \sigma^2 \sum_{i=1}^K \lambda_i^{-1}$ (where $\lambda_i$ are eigenvalues of $\Sigma$ ) is minimized when all eigenvalues are equal, i.e., $\Sigma$ is isotropic.

Nonlinear Probing Analysis

For k-NN and kernel regression, minimizing integrated squared bias similarly singles out the isotropic Gaussian:

For radius-based k-NN,

$\widehat{y}(x) = \frac{1}{|\{i: \|z_i - x\| \leq r\}|} \sum_{i: \|z_i - x\| \leq r} y_i,$

the bias term's dependence on $\nabla \log p_z$ implies that an isotropic Gaussian $p_z = \mathcal{N}(0, \sigma^2 I)$ uniquely minimizes risk.

Analogous arguments hold for Nadaraya–Watson kernel regression.

Theorem: Over all embedding distributions with equal total variance, the isotropic Gaussian uniquely minimizes both worst-case linear and nonlinear probing prediction risk.

3. Sketched Isotropic Gaussian Regularization (SIGReg)

Having established that isotropic Gaussian embeddings are optimal, LeJEPA introduces Sketched Isotropic Gaussian Regularization (SIGReg) to enforce this distributional structure. SIGReg is constructed as follows:

Let $\mathcal{A} = \{a_1, \hdots, a_M\} \subset \mathbb{S}^{K-1}$ be $M$ randomly sampled directions in embedding space.
Each set of embeddings $\{z_n\}_{n=1}^N$ is projected onto $a_i$ , yielding 1D samples $\{a_i^\top z_n\}$ .
For each direction, a univariate statistical test $T$ (e.g., Epps–Pulley) is used to measure goodness-of-fit to $\mathcal{N}(0,1)$ .
The per-batch SIGReg statistic is:

$\mathrm{SIGReg}_T(\{z_n\}) = \frac{1}{M} \sum_{i=1}^M T( \{ a_i^\top z_n \}_{n=1}^N ).$

The Epps–Pulley (EP) test is recommended due to its differentiability, bounded derivatives, and suitability for distributed settings (the empirical characteristic function can be all-reduced across devices). SIGReg is computationally efficient: for batch size $N$ , direction count $M$ , and quadrature points $T$ , the complexity is $\mathcal{O}(N M T)$ with linear scaling in batch size.

4. Training Objective and Loss Construction

LeJEPA’s loss function couples a predictive term with SIGReg:

The prediction loss computes squared distances between each view’s embedding and the centroid over $V_g$ “global” views:

$\mu_n = \frac{1}{V_g} \sum_{v=1}^{V_g} f_\theta(x_{n,v}), \quad \mathcal{L}_\mathrm{pred} = \frac{1}{N V} \sum_{n=1}^N \sum_{v'=1}^{V} \| \mu_n - z_{n,v'} \|^2_2.$

The regularization loss applies SIGReg to each view:

$\mathcal{L}_\mathrm{sig} = \frac{1}{V} \sum_{v'=1}^V \mathrm{SIGReg}( \{ f_\theta(x_{n,v'}) \}_{n=1}^N ).$

The combined loss is:

$\boxed{ \mathcal{L}_\mathrm{LeJEPA} = \mathcal{L}_\mathrm{pred} + \lambda \mathcal{L}_\mathrm{sig} }$

where $\lambda$ is the sole trade-off hyperparameter.

Recommended hyperparameters:

$\lambda = 0.05$
$V_g = 2$ global views, $V_\ell = 6$ local views
$M = 1024$ directions, $T = 17$ quadrature points

Performance is stable for $\lambda \in [0.01, 0.1]$ across architectures, datasets, and batch sizes down to 128.

5. Algorithmic Efficiency and Distributed Training

Each training batch comprises $B$ samples and $V$ views, resulting in $BV$ embeddings per batch. The prediction term operates with complexity $\mathcal{O}(BVK)$ . SIGReg, as implemented, involves:

Sampling $M$ random directions (synchronized across devices via shared seed or global step)
Projecting $B$ embeddings to $B \times M$ scalars
Evaluating the EP statistic for each direction—requiring all-reduce operations to maintain statistical properties across data-parallel replicas

Example timings demonstrate practical efficiency: with $B=512$ , $M=512$ , and $T=16$ , SIGReg forward+backward is approximately 0.46 ms on a V100 GPU. Both time and memory requirements scale linearly with batch size. SIGReg requires only standard PyTorch DDP primitives and can be implemented in under 50 lines of code. Optimization of SIGReg and prediction terms proceeds jointly by gradient descent, with learning rate annealing via cosine schedule and without specialized warmup or regularization scheduling.

LeJEPA’s algorithmic simplicity is further highlighted by the absence of auxiliary mechanisms commonly seen in other JEPA variants, such as stop-gradients, teacher EMA, whitening layers, negative sampling, or explicit covariance-tracking modules.

6. Empirical Validation and Performance

Empirical studies validate LeJEPA across more than 10 datasets and 60 architectures, spanning scales from small to large models and diverse domains.

ImageNet-1K pretraining (100 epochs):
- ViT-H/14 (650M parameters), linear probing with frozen backbone: 79.0% top-1
- ConvNeXtV2-Huge (660M): 78.5%
Comparative efficiency: With 3× fewer epochs, LeJEPA achieves 1–2% higher accuracy than I-JEPA on comparable backbone scales.
Domain specialization: On Galaxy10 with 11k samples, LeJEPA outperforms DINOv2/v3 transfer by 8–10 points in both few-shot and full finetune regimes.
Generalization: Out-of-the-box performance on 60+ “timm” models gives >90% top-1 on ImageNet-10 and 60–80% on ImageNet-100.
Semantic properties: Emergent structure is observed: last-layer PCA/color-coding yields clear object/background separation; simple thresholding of [CLS] self-attention produces unsupervised video object segmentation.
Model selection: Training loss shows ≈99% Spearman correlation with linear-probe accuracy, supporting label-free model selection.

7. Design Principles and Implications

LeJEPA defines a new paradigm in self-supervised joint-embedding learning by combining two loss components—view prediction and rigorously designed distributional regularization—without recourse to empirical heuristics or tuning schedules. A single, theoretically motivated regularizer guarantees collapse avoidance and the optimality of learned representations.

All architectural and optimization choices—hyperparameterization, batch-wise SIGReg implementation, distributed synchronization—are validated empirically for stability and generality. LeJEPA operates robustly across convolutional, residual, and transformer-based architectures, as well as classical and high-dimensional domains, under diverse resource constraints. The approach supports seamless scaling to large distributed systems (8–64 GPUs) without algorithmic modification.

A plausible implication is that the introduction of a single, provably correct regularizer—rather than layers of heuristics—may simplify future work on self-supervised learning, facilitate reproducibility, and support more efficient, theoretically analyzable advances within the JEPA family of methods.

PDF Markdown Chat (Pro)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Latent-Euclidean JEPA (LeJEPA).