Gaussian Embeddings: How JEPAs Secretly Learn Your Data Density (2510.05949v1)

Published 7 Oct 2025 in cs.LG, cs.AI, cs.CV, and stat.ML

Abstract: Joint Embedding Predictive Architectures (JEPAs) learn representations able to solve numerous downstream tasks out-of-the-box. JEPAs combine two objectives: (i) a latent-space prediction term, i.e., the representation of a slightly perturbed sample must be predictable from the original sample's representation, and (ii) an anti-collapse term, i.e., not all samples should have the same representation. While (ii) is often considered as an obvious remedy to representation collapse, we uncover that JEPAs' anti-collapse term does much more--it provably estimates the data density. In short, any successfully trained JEPA can be used to get sample probabilities, e.g., for data curation, outlier detection, or simply for density estimation. Our theoretical finding is agnostic of the dataset and architecture used--in any case one can compute the learned probabilities of sample $x$ efficiently and in closed-form using the model's Jacobian matrix at $x$. Our findings are empirically validated across datasets (synthetic, controlled, and Imagenet) and across different Self Supervised Learning methods falling under the JEPA family (I-JEPA and DINOv2) and on multimodal models, such as MetaCLIP. We denote the method extracting the JEPA learned density as {\bf JEPA-SCORE}.

Summary

The paper establishes that JEPA anti-collapse regularization forces encoders to learn and estimate data densities through the Jacobian.
The JEPA-SCORE, computed from the log-sum of Jacobian singular values, effectively correlates with true log-densities for outlier detection and data curation.
Empirical results across synthetic, ImageNet, and OOD datasets confirm that JEPA-trained models provide consistent, robust density estimation.

Gaussian Embeddings: How JEPAs Secretly Learn Your Data Density

Introduction

This paper establishes a theoretical and empirical connection between Joint Embedding Predictive Architectures (JEPAs) and nonparametric data density estimation. The central claim is that the anti-collapse (diversity) term in JEPA objectives, which is typically viewed as a mechanism to prevent representational collapse, in fact compels the model to implicitly estimate the data density. The authors introduce the JEPA-SCORE, a closed-form, efficient estimator of sample likelihood derived from the Jacobian of the trained encoder, and demonstrate its utility for outlier detection, data curation, and density estimation. The analysis is agnostic to the specific dataset or architecture and is validated across synthetic, controlled, and large-scale real-world datasets using state-of-the-art JEPA models.

Theoretical Foundations

The paper begins by revisiting the geometry of high-dimensional Gaussian embeddings. It is shown that as the embedding dimension $K$ increases, normalized $K$ -dimensional standard Gaussian vectors concentrate on the hypersphere and converge to a uniform distribution over its surface. This geometric property underpins the behavior of JEPA-trained encoders, which are explicitly or implicitly regularized to produce Gaussian-distributed embeddings.

The key theoretical result leverages the change-of-variable formula for densities under differentiable mappings. For a deep network $f$ mapping input $x$ to embedding $f(x)$ , the density of the embedding is given by:

$p_{f(X)}(f(x)) = \int_{\{z \mid f(z) = f(x)\}} \frac{p_X(z)}{\prod_{k=1}^{\text{rank}(J_f(z))} \sigma_k(J_f(z))} d\mathcal{H}^r(z)$

where $J_f(z)$ is the Jacobian of $f$ at $z$ , $\sigma_k$ are its singular values, and $\mathcal{H}^r$ is the $r$ -dimensional Hausdorff measure over the level set. The implication is that for $f(X)$ to be distributed as a standard Gaussian, $f$ must learn to modulate the input density $p_X$ via its Jacobian, effectively internalizing the data density.

JEPA-SCORE: Closed-Form Density Estimation

The main practical contribution is the derivation of the JEPA-SCORE, a sample-wise estimator of the data density learned by a JEPA-trained encoder. For an input $x$ , the JEPA-SCORE is defined as:

$\text{JEPA-SCORE}(x) = \sum_{k=1}^{\text{rank}(J_f(x))} \log \sigma_k(J_f(x))$

This score is proportional to the log-likelihood of $x$ under the implicit density learned by the encoder. The estimator is efficient to compute, requiring only the singular values of the Jacobian, and is robust to the choice of numerical stabilization parameters.

The following PyTorch code snippet implements the JEPA-SCORE:

from torch.autograd.functional import jacobian

J = jacobian(lambda x: model(x).sum(0), inputs=images)
with torch.inference_mode():
    J = J.flatten(2).permute(1, 0, 2)
    svdvals = torch.linalg.svdvals(J)
    jepa_score = svdvals.clip_(eps).log_().sum(1)

Empirical Validation

The authors validate the theoretical claims through a series of experiments:

Synthetic Data: On Gaussian mixture models, the JEPA-SCORE correlates strongly with the true log-density, and Langevin dynamics using the score can recover the data distribution.
Figure 1: Depiction of JEPA-SCORE on synthetic data, showing the encoder's Jacobian modulates density to match the hypersphere uniformity constraint.
ImageNet and Out-of-Distribution Data: On ImageNet-1k, samples with high and low JEPA-SCOREs correspond to semantically prototypical and atypical images, respectively. Out-of-distribution datasets (e.g., MNIST, Galaxy) yield systematically lower scores, indicating the model's ability to detect outliers.

Figure 2: Depiction of the 5 least (left) and 5 most (right) likely samples of class 21 from ImageNet as per JEPA-SCORE, consistent across multiple JEPA models.

Figure 3: Random samples from ImageNet-1k class 21, for comparison with the extremal JEPA-SCORE samples.

Class-wise Consistency: Across different JEPA models (DINOv2, I-JEPA, MetaCLIP), the ordering of samples by JEPA-SCORE is highly consistent, and the same images are identified as high- or low-probability within a class.
Outlier Detection: The distribution of JEPA-SCOREs for in-distribution and out-of-distribution samples is well-separated, supporting its use for data curation and model assessment.

Figure 4: Histogram of JEPA-SCOREs for 5,000 samples from various datasets, showing clear separation between in-distribution and out-of-distribution data.

Architectural and Implementation Considerations

The JEPA-SCORE is model-agnostic and applies to any encoder trained with a JEPA objective, including moment-matching, non-parametric, and teacher-student variants. The only requirement is access to the encoder's Jacobian, which is tractable for moderate batch sizes and embedding dimensions. For large-scale models, efficient Jacobian-vector product techniques or randomized SVD approximations may be necessary.

The method is robust to the choice of data augmentations and is not sensitive to the specific form of the predictive invariance term, as the anti-collapse regularizer is the primary driver of density learning. The approach does not require explicit generative modeling or reconstruction, distinguishing it from traditional score-based or likelihood-based generative models.

Implications and Future Directions

The findings have several important implications:

Nonparametric Density Estimation: JEPA-trained encoders provide a closed-form, nonparametric estimator of the data density, sidestepping the need for explicit generative modeling or input space reconstruction.
Outlier and OOD Detection: The JEPA-SCORE can be used for robust outlier detection, data curation, and model readiness assessment for downstream tasks.
Theoretical Unification: The work bridges the gap between self-supervised representation learning and score-based generative modeling, suggesting that representation learning objectives can yield implicit generative capabilities.
Practical Utility: The method is simple to implement, computationally efficient, and applicable to a wide range of JEPA-based models and data modalities.

Potential future directions include scaling the approach to very high-dimensional inputs, integrating JEPA-SCORE into active learning or data selection pipelines, and exploring its use for generative modeling via score-based sampling.

Conclusion

This paper demonstrates that JEPAs, through their anti-collapse regularization, necessarily internalize the data density and enable closed-form, efficient density estimation via the JEPA-SCORE. The theoretical analysis is supported by strong empirical evidence across synthetic and real-world datasets. The results challenge the prevailing view that JEPA-based self-supervised learning is orthogonal to generative modeling and open new avenues for leveraging pretrained encoders in density estimation, outlier detection, and beyond.