Contrastive-JEPA: Enhanced Visual SSL

Updated 19 January 2026

Contrastive-JEPA is a self-supervised visual representation learning framework that extends JEPA with VICReg regularization to prevent collapse and enforce mean invariance.
It combines masked predictive modeling with contrastive-style regularization to achieve faster convergence and superior downstream performance on benchmarks like ImageNet-1K.
Empirical results demonstrate improved linear probing, fine-tuning accuracy, and better performance in object detection and segmentation compared to other state-of-the-art self-supervised methods.

Contrastive-JEPA (C-JEPA) is a self-supervised visual representation learning framework that augments the Joint-Embedding Predictive Architecture (JEPA) with Variance-Invariance-Covariance Regularization (VICReg) to enhance learning stability and prevent representational collapse. By combining predictive masked modeling and contrastive-style regularization, C-JEPA achieves superior training dynamics and downstream performance compared to prior JEPA-based approaches and other state-of-the-art self-supervised methods (Mo et al., 2024).

1. Foundations: JEPA and Its Limitations

The JEPA framework aims to learn high-quality image representations by predicting masked regions of an input image from its unmasked regions in a latent embedding space. In the canonical Image-based JEPA (I-JEPA), an image $x$ is transformed into a context view $x_c$ through masking some patch blocks and a target view $x_t$ containing only masked blocks. The context encoder $f_\theta$ operates on $x_c$ , providing latent embeddings for visible patches and mask tokens for masked ones. The target encoder $f'_\theta$ , an Exponential Moving Average (EMA) copy of $f_\theta$ , processes $x_t$ yielding embeddings for masked patches. A predictor $g_\theta$ attempts to reconstruct the target patch embeddings from the associated masked tokens within the context embedding. The key JEPA loss is a mean squared error between these predicted and actual masked patch embeddings: $\mathcal{L}_{\rm I\!-\!JEPA} = \frac1{|M|}\sum_{i=1}^M\sum_{j\in \mathcal B_i} \big\lVert \hat z_{y_j} - z^a_{y_j}\big\rVert_2^2$ where $\hat z_{y_j}$ is the prediction and $z^a_{y_j}$ is the stop-gradient target.

I-JEPA presents two critical limitations:

EMA does not prevent "entire collapse", where all learned representations converge to a constant, drastically reducing utility.
The predictor fails to guarantee accuracy in the mean vector of patch embeddings across views, leading to insufficient mean invariance.

These issues also appear in non-contrastive SSL methods such as SimSiam, for which similar failures of the EMA dynamic have been documented.

2. C-JEPA Architecture and Loss Composition

C-JEPA extends I-JEPA by integrating VICReg's regularization strategies into the model's pipeline. The process encompasses:

Input Augmentation and Masking: Two augmentations ( $v_1, v_2$ ) of input image $x$ are produced. Each is split into a context view $x_c$ (masked) and a target view $x_t$ (containing only masked blocks).
Context Encoder: A Vision Transformer (ViT, e.g., ViT-B/16) encodes $x_c$ into patch embeddings $z_{x,b}\in\mathbb{R}^d$ .
Target Encoder: The EMA copy $f'_\theta$ encodes $x_t$ to target embeddings $z^a_{y_j}$ .
Predictor Head: Transformer or MLP of moderate depth (6–12 layers) predicts $g_\theta(z_{x,b})$ for the masked positions.
VICReg Projector: A 2–3 layer MLP $h(\cdot)$ maps mean context embeddings at masked regions to projected vectors $p_i \in \mathbb{R}^d$ for regularization.

The C-JEPA total loss is a weighted combination: $\mathcal L_{C\!-\!JEPA} = \mathcal L_{\rm pred} + \lambda_v \mathcal L_{\rm var} + \lambda_i \mathcal L_{\rm inv} + \lambda_c \mathcal L_{\rm cov}$ with:

$\mathcal L_{\rm pred}$ : patch prediction MSE with stop-gradient on target,
$\mathcal L_{\rm var}$ : encourages sufficient per-dimension variance,
$\mathcal L_{\rm inv}$ : enforces mean invariance between views,
$\mathcal L_{\rm cov}$ : disables inter-feature correlation.

Standard VICReg weights are $\lambda_v = 25, \lambda_i = 25, \lambda_c = 1$ , with the regularization block optionally downscaled by a small $\beta_{\rm vicreg}$ relative to $\mathcal L_{\rm pred}$ to balance influence.

3. Training Procedures and Implementation Specifics

C-JEPA is pretrained on unlabeled ImageNet-1K data, with different regimes depending on model size: 600 epochs for larger ViTs (B/L), and 100 epochs for smaller ones (T/S). The optimizer is AdamW, with weight decay from 0.04 to 0.4, batch size 2048, and a learning rate scheduled from $10^{-4}$ to $10^{-3}$ over 15 epochs and cosine-decayed to $10^{-6}$ thereafter.

The EMA momentum parameter $m$ is annealed from 0.996 to 1.0. Each view applies four (possibly overlapping) $16\times16$ block masks. Predictor depths are 6 for ViT-T/S/B and 12 for ViT-L; embedding dimensions are 384 (192 for ViT-T), and MLP projector output matches encoder dim $d$ . Layer normalization and small $\epsilon$ stabilize the variance term.

4. Theoretical Guarantees: Preventing Collapse and Mean Alignment

By analyzing the JEPA predictor as a linear map $W_P$ and decomposing into eigenspaces (invoking Neural Tangent Kernel theory), it is shown that the stop-gradient predictor loss yields: $\frac{d\hat z_k}{dt} = \eta\lambda_k(1-\lambda_k)\hat z_k$ for eigen-mode $k$ . Without the predictor or for unstable $\lambda_k$ , collapse may occur. Removing stop-grad results in

$\frac{d\hat z_k}{dt} = -\eta(1-\lambda_k)^2\hat z_k,$

guaranteeing collapse to zero. The inclusion of VICReg’s variance and covariance regularizers enforces per-dimension variance and zero off-diagonal covariance, stabilizing the training dynamics. The invariance regularizer aligns mean embeddings, directly improving the model's ability to capture consistent latent semantics between augmented views.

5. Empirical Performance and Comparative Evaluation

C-JEPA demonstrates improved representation quality relative to I-JEPA and other masked or non-contrastive baselines (MAE, BEiT, iBOT, data2vec, VICReg) across multiple tasks.

Key Results on ImageNet-1K

Model	Linear Probing	Fine-tune	COCO Box AP	ADE20K mIoU
I-JEPA B/16	72.9%	83.5%	49.9	47.6
C-JEPA B/16	73.7%	84.5%	50.7	48.7
I-JEPA L/16	77.5%	85.3%	—	—
C-JEPA L/16	78.1%	86.2%	—	—

Performance gains are systematic across COCO box/mask, ADE20K segmentation, and low-level video/vision tasks such as DAVIS and CLEVR, indicating broader downstream utility (Mo et al., 2024).

6. Ablation Analyses

Ablations confirm the individual and joint importance of the three VICReg components. On ViT-B/16 (100 epochs), base JEPA achieves 63.7% linear probing accuracy; adding variance and covariance yields 68.3%, invariance alone 67.6%, and all three delivers 69.5%. For extended pretraining (600 epochs), the corresponding gains remain: base 72.9%, variance+covariance 73.5%, invariance 73.2%, all terms 73.7%. Variation in VICReg strength reveals an optimal $\beta_{\rm vicreg} = 0.001$ ; too small leads to suboptimal performance, while excessively high weighting induces collapse. Adjusting the invariance weight also shows best results for intermediate values (e.g., 15).

7. Significance, Limitations, and Prospects

C-JEPA advances the field of predictive joint-embedding self-supervision by providing a tractable, empirically robust mechanism for preventing representational collapse and enforcing meaningful invariance in patch-level means. It achieves faster convergence and superior performance on large-scale vision benchmarks compared to both I-JEPA and related baselines. Limitations include the need for careful tuning of VICReg weights—over-regularization can reintroduce collapse—and open questions regarding extension to multimodal data, alternative masking strategies, and further scaling. Directions for improvement include adaptive or learned weighting of the regularization terms and extension to larger or multimodal architectures (Mo et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

Connecting Joint-Embedding Predictive Architecture with Contrastive Self-supervised Learning (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Contrastive-JEPA (C-JEPA).