Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

167 tokens/sec

GPT-4o

7 tokens/sec

Gemini 2.5 Pro Pro

42 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

42 1

Revisiting Feature Prediction for Learning Visual Representations from Video (2404.08471v1)

Published 15 Feb 2024 in cs.CV, cs.AI, and cs.LG

Abstract: This paper explores feature prediction as a stand-alone objective for unsupervised learning from video and introduces V-JEPA, a collection of vision models trained solely using a feature prediction objective, without the use of pretrained image encoders, text, negative examples, reconstruction, or other sources of supervision. The models are trained on 2 million videos collected from public datasets and are evaluated on downstream image and video tasks. Our results show that learning by predicting video features leads to versatile visual representations that perform well on both motion and appearance-based tasks, without adaption of the model's parameters; e.g., using a frozen backbone. Our largest model, a ViT-H/16 trained only on videos, obtains 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet1K.

References (81)

Citations (45)

View on Semantic Scholar

Summary

The paper demonstrates that feature prediction on raw video data yields robust visual representations without relying on pre-trained image encoders.
It details the V-JEPA methodology, adapting a joint embedding architecture for video by masking spatio-temporal patches to predict target features.
Evaluations on Kinetics-400, Something-Something-v2, and ImageNet-1K indicate competitive accuracies using a frozen backbone.

This paper investigates the efficacy of feature prediction as a primary objective for unsupervised visual representation learning directly from video data (2404.08471). The work introduces V-JEPA (Video Joint Embedding Predictive Architecture), a framework trained exclusively via a feature prediction loss. Notably, this approach eschews common techniques such as pre-trained image encoders (like CLIP or ImageNet pre-training), the use of text data, negative sampling strategies prevalent in contrastive learning, or pixel-level reconstruction objectives.

V-JEPA Methodology

The core idea of V-JEPA adapts the Joint Embedding Predictive Architecture (JEPA) paradigm, originally proposed for static images, to the video domain. The architecture comprises three main components:

Context Encoder: Processes a spatio-temporal context block (a subset of video patches) and outputs a representation summarizing this visible context.
Predictor: Takes the context representation as input and predicts the representations of target blocks (masked-out portions of the video). The predictor is typically a lighter-weight network (e.g., a shallow transformer) compared to the encoder.
Target Encoder: Computes the target representations for the masked blocks. Crucially, the target encoder shares weights with the context encoder, but its gradients are stopped during backpropagation. This ensures the target representations remain stable within an optimization step, providing a consistent prediction objective.

The learning process involves the following steps:

A video clip is sampled and divided into spatio-temporal patches.
Multiple non-overlapping target blocks are masked out. The remaining patches form the context block.
The context encoder processes the visible context patches.
The target encoder processes the masked target patches to compute the target features.
The predictor takes the context representation and the positions of the target blocks as input and generates predicted features for each target block.
The loss function minimizes the L2 distance between the predicted features and the target features, aggregated over all target blocks.

$L = \sum_{i \in \text{masked blocks}} || \text{Predictor}(\text{Encoder}(\text{Context}), \text{Pos}_i) - \text{StopGrad}(\text{Encoder}(\text{Target}_i)) ||_2^2$

This self-supervised objective forces the model to learn internal representations that capture the underlying structure and dynamics within the video, enabling the prediction of missing spatio-temporal content at the feature level. The masking strategy encourages the model to develop high-level, semantic understanding rather than relying on low-level pixel correlations for reconstruction.

Training and Architecture Details

The V-JEPA models were trained on a large-scale dataset comprising 2 million unlabeled videos sourced from publicly available datasets. The paper utilized Vision Transformer (ViT) architectures as the backbone for the encoders. The largest model reported employed a ViT-H/16 architecture. The training exclusively relied on the feature prediction objective described above, without incorporating any external supervision or pre-trained weights. This "video-only" training regime is a key aspect of the work, aiming to demonstrate the power of learning directly from temporal dynamics and spatial context inherent in video data.

Evaluation and Performance

The efficacy of the learned representations was evaluated on a diverse set of downstream tasks spanning both image and video domains. A significant aspect of the evaluation protocol was the use of a frozen backbone. This means the pre-trained V-JEPA encoder weights were kept fixed, and only lightweight linear classifiers or adapters were trained on top for each specific downstream task. This evaluation methodology specifically probes the quality and generalizability of the learned representations themselves, independent of task-specific fine-tuning of the entire network.

The results demonstrate strong performance across tasks demanding different capabilities:

Action Recognition (Kinetics-400): The ViT-H/16 V-JEPA model achieved 81.9% top-1 accuracy. This indicates the learned representations effectively capture motion patterns and appearance cues relevant for classifying human actions.
Temporal Reasoning (Something-Something-v2): The same model obtained 72.2% top-1 accuracy. Success on SSv2 is particularly noteworthy as it heavily relies on understanding temporal relationships and object interactions, suggesting the feature prediction objective successfully internalized motion dynamics.
Image Classification (ImageNet-1K): The model achieved 77.9% top-1 accuracy under the linear probing protocol. This result is compelling because the model was trained exclusively on videos without any explicit image-based pre-training, yet it yields strong performance on a standard image benchmark, highlighting the versatility of the learned visual features.

These results collectively suggest that learning solely by predicting spatio-temporal features in video leads to robust and versatile representations applicable to both motion-centric and appearance-based tasks without requiring parameter adaptation.

Contributions and Significance

The primary contribution of this work is the empirical demonstration that a pure feature prediction objective, implemented within the V-JEPA framework, is sufficient for learning high-quality visual representations from large-scale video data. It challenges the necessity of prevalent techniques like contrastive learning (which requires careful negative sampling), generative reconstruction (which can focus on low-level details), or reliance on pre-trained image encoders or multi-modal data (like text).

By achieving strong performance on diverse benchmarks using a frozen backbone trained only on video feature prediction, the paper underscores the potential of self-supervised learning methods that focus on understanding and predicting the inherent structure of the visual world as presented in video sequences. The results indicate that motion and temporal consistency provide a powerful supervisory signal that can be effectively leveraged through predictive objectives at the feature level.

Conclusion

"Revisiting Feature Prediction for Learning Visual Representations from Video" (2404.08471) presents V-JEPA, a self-supervised learning approach based solely on feature prediction in videos. Trained without pre-trained encoders, negative samples, or reconstruction losses, V-JEPA demonstrates that predicting masked spatio-temporal features is a highly effective method for acquiring versatile visual representations. The strong performance achieved with frozen backbones on Kinetics-400, Something-Something-v2, and even ImageNet-1K validates the approach and suggests that feature prediction is a potent standalone objective for unsupervised representation learning from video.

PDF Markdown

Tweets

https://twitter.com/ITica007/status/1888423179334434986

https://twitter.com/storyofAiGuess/status/1940043984006754631

YouTube

Show All Videos