Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture (2301.08243v3)
Abstract: This paper demonstrates an approach for learning highly semantic image representations without relying on hand-crafted data-augmentations. We introduce the Image-based Joint-Embedding Predictive Architecture (I-JEPA), a non-generative approach for self-supervised learning from images. The idea behind I-JEPA is simple: from a single context block, predict the representations of various target blocks in the same image. A core design choice to guide I-JEPA towards producing semantic representations is the masking strategy; specifically, it is crucial to (a) sample target blocks with sufficiently large scale (semantic), and to (b) use a sufficiently informative (spatially distributed) context block. Empirically, when combined with Vision Transformers, we find I-JEPA to be highly scalable. For instance, we train a ViT-Huge/14 on ImageNet using 16 A100 GPUs in under 72 hours to achieve strong downstream performance across a wide range of tasks, from linear classification to object counting and depth prediction.
Summary
- The paper introduces I-JEPA, a self-supervised framework that learns semantic image features by predicting abstract representations from large context blocks.
- It eliminates the need for hand-crafted data augmentations by using a multi-block masking strategy that focuses on high-level semantic prediction over pixel details.
- Experimental results show competitive ImageNet performance and low-shot accuracy while achieving improved computational efficiency.
The Image-based Joint-Embedding Predictive Architecture (I-JEPA) is a self-supervised learning (SSL) method designed to learn semantic image representations without relying on hand-crafted data augmentations, unlike prevalent contrastive or invariance-based methods (2301.08243). It also diverges from generative masking approaches like Masked Autoencoders (MAE) by performing prediction in an abstract representation space rather than reconstructing raw pixels or discrete tokens. The core principle involves predicting the representations of multiple target image blocks using the representation derived from a single, larger context block from the same image. This predictive task, performed entirely within the embedding space, guides the model towards capturing semantic information while potentially discarding low-level pixel details.
Methodology
The I-JEPA framework employs three main components, typically implemented using Vision Transformers (ViTs):
- Context Encoder (fθ): This network processes a partial view of the image, designated as the context block. It takes the visible patches from the context block as input and outputs patch-level representations.
- Target Encoder (fθˉ): This network generates the prediction targets. Crucially, its parameters (θˉ) are updated as an exponential moving average (EMA) of the context encoder parameters (θ). This momentum-based update prevents representational collapse, a common issue in symmetric SSL architectures. The target encoder processes the entire image to produce representations for all patches. The target representations for specific blocks are then extracted from this full-image output.
- Predictor (gϕ): This network takes the output of the context encoder (fθ) corresponding to the visible context patches, along with positional encodings (mask tokens) indicating the spatial locations of the target blocks. It predicts the representations of the target blocks as generated by the target encoder. The predictor is typically designed to be narrower and shallower than the encoders.
The training process unfolds as follows:
- Target Sampling: Multiple target blocks (e.g., 4 blocks, each covering 15-20% of the image area) are randomly sampled from an image x.
- Target Representation Generation: The full image x is passed through the target encoder fθˉ to obtain representations sy=fθˉ(x). The representations corresponding to the patches within the sampled target blocks, denoted as yi for target i, are selected. These serve as the ground truth targets for the prediction task.
- Context Sampling: A single, larger context block (e.g., covering 85-100% of the image area) is sampled from the same image x. To ensure a non-trivial prediction task, any patches within the context block that spatially overlap with any of the target blocks are removed (masked out).
- Context Representation Generation: Only the remaining visible patches of the context block are fed into the context encoder fθ, producing representations sx.
- Prediction: For each target block i, the predictor gϕ takes the context representations sx and learnable mask tokens embedded with the positional information of target i. It outputs predicted representations y^i=gϕ(sx,maski).
- Loss Calculation: The loss is computed as the average L2 distance between the predicted representations and the target representations across all M target blocks:
L(θ,ϕ)=M1i=1∑M∥y^i−yi∥22
- Optimization: The parameters of the context encoder (θ) and the predictor (ϕ) are updated via stochastic gradient descent to minimize the loss L. The target encoder parameters (θˉ) are updated using the EMA rule: θˉ←τθˉ+(1−τ)θ, where τ is a momentum coefficient typically close to 1.
A critical aspect of I-JEPA is the masking strategy. The paper emphasizes that selecting relatively large target blocks encourages the prediction of semantic concepts rather than fine-grained details. Concurrently, using a sufficiently large and informative context block (spatially distributed) provides the necessary information for successful prediction. The proposed "multi-block" strategy (multiple targets predicted from one large context block, with target-context overlap removed) proved effective.
Key Innovations and Contributions
I-JEPA introduces several key contributions to the field of self-supervised learning from images:
- Augmentation-Free Semantic Learning: It demonstrates the feasibility of learning high-quality semantic representations without recourse to engineered data augmentations common in invariance-based methods (e.g., random cropping, color jittering, blur). This potentially reduces the inductive biases imposed by specific augmentation choices, possibly leading to more generalizable representations.
- Prediction in Abstract Representation Space: Shifting the predictive objective from the pixel or token level (as in MAE, BEiT) to an abstract representation space (output of the target encoder) is a core innovation. This encourages the model to focus on higher-level, semantic information by abstracting away pixel-level redundancy and noise. The ablation studies presented strongly support the superiority of representation prediction over pixel prediction within this architecture.
- Computational Efficiency: I-JEPA is presented as computationally efficient compared to both generative and invariance-based methods. Training a ViT-H/14 on ImageNet reportedly requires less than 1200 A100 GPU hours, significantly less than MAE (over 10x more) and less than multi-view methods like iBOT (over 2.5x more for smaller models), facilitating scalability to larger models (ViT-Huge, ViT-Giant) and datasets.
- Versatile Representations: The learned representations demonstrate strong performance across a diverse set of downstream tasks, including high-level semantic tasks (linear classification, low-shot classification) and low-level spatial reasoning tasks (object counting, depth prediction). This contrasts with some methods that excel primarily on one category of tasks.
Experimental Results and Evaluation
The empirical evaluation of I-JEPA showcased its effectiveness across various standard benchmarks:
- ImageNet Performance:
- Linear Probing: I-JEPA trained on ViT-H/14 achieved 79.3% top-1 accuracy, outperforming MAE ViT-H/14 (77.2%). Scaling up with a larger input resolution (ViT-H/16448) reached 81.1%, competitive with leading augmentation-based methods.
- Low-Shot (1% ImageNet): I-JEPA ViT-H/14 achieved 73.3% top-1, surpassing MAE ViT-H/14 (71.5%). The scaled ViT-H/16448 model achieved 77.3%, exceeding results reported for MSN and DINO.
- Transfer Learning: On linear probing evaluations across CIFAR100, Places205, and iNat18, I-JEPA consistently outperformed MAE and data2vec. It achieved performance comparable to or exceeding augmentation-based methods like DINO on several datasets (e.g., CIFAR100, Places205).
- Low-Level Vision Tasks (Clevr): I-JEPA demonstrated a notable advantage on tasks requiring spatial understanding. With a ViT-H/14 backbone, it achieved state-of-the-art results on Clevr/Count (90.0%) and Clevr/Dist (74.6%), significantly outperforming invariance-based methods like DINO and iBOT, while remaining competitive with MAE. This suggests I-JEPA retains more local spatial information compared to methods relying solely on enforcing invariance across augmented views.
- Efficiency: The training time for ViT-H/14 was reported as under 72 hours on 16 A100 GPUs (< 1200 GPU hours), substantiating the claims of improved computational efficiency.
- Ablation Studies: Key findings include:
- Predicting raw RGB pixel values instead of target encoder representations leads to a drastic drop in performance, highlighting the benefit of prediction in the abstract space.
- The proposed multi-block masking strategy (multiple targets, one large context minus targets) outperformed simpler alternatives like predicting a single block or using random masking patterns.
- Visualizations indicated the predictor learns to infer high-level object properties like shape and pose, even under positional uncertainty.
Position within Self-Supervised Learning
I-JEPA represents a distinct direction within SSL, positioned between purely generative reconstruction methods and augmentation-driven invariance methods. It leverages the concept of prediction, similar to generative models, but applies it within a joint-embedding architecture using an abstract target space, akin to invariance methods but without enforcing strict invariance.
By eschewing hand-crafted augmentations, I-JEPA avoids the potential limitations and biases introduced by augmentation pipelines. By predicting abstract representations rather than pixels/tokens, it avoids expending model capacity on reconstructing fine-grained, potentially irrelevant details, enabling more efficient learning of semantic features. The strong performance on both high-level semantic tasks and low-level spatial tasks suggests that the architecture effectively captures a rich hierarchy of features. Its computational efficiency further enhances its appeal, particularly for scaling to very large models and datasets. I-JEPA thus offers a compelling framework that combines advantages often associated separately with generative and invariance-based approaches.
In conclusion, the I-JEPA framework provides a method for learning semantic image representations through prediction in an abstract embedding space, removing the reliance on data augmentations. Its demonstrated efficiency, scalability, and strong performance across diverse downstream tasks position it as a significant development in self-supervised learning.
Related Papers
- Learning and Leveraging World Models in Visual Representation Learning (2024)
- Revisiting Feature Prediction for Learning Visual Representations from Video (2024)
- CNN-JEPA: Self-Supervised Pretraining Convolutional Neural Networks Using Joint Embedding Predictive Architecture (2024)
- Video Representation Learning with Joint-Embedding Predictive Architectures (2024)
- V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning (2025)
Tweets
YouTube
HackerNews
- Self-Supervised Learning from Images with JEPA (2023) (40 points, 10 comments)
- Self-Supervised Learning from Images with a Joint-Embedding Predictive Archi (1 point, 0 comments)
- Self-Supervised Learning from Images with JEPA (1 point, 0 comments)