Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Revisiting Feature Prediction for Learning Visual Representations from Video (2404.08471v1)

Published 15 Feb 2024 in cs.CV, cs.AI, and cs.LG

Abstract: This paper explores feature prediction as a stand-alone objective for unsupervised learning from video and introduces V-JEPA, a collection of vision models trained solely using a feature prediction objective, without the use of pretrained image encoders, text, negative examples, reconstruction, or other sources of supervision. The models are trained on 2 million videos collected from public datasets and are evaluated on downstream image and video tasks. Our results show that learning by predicting video features leads to versatile visual representations that perform well on both motion and appearance-based tasks, without adaption of the model's parameters; e.g., using a frozen backbone. Our largest model, a ViT-H/16 trained only on videos, obtains 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet1K.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (81)
  1. Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text. Advances in Neural Information Processing Systems, 34:24206–24221, 2021.
  2. Vivit: A video vision transformer. In Proceedings of the IEEE international conference on computer vision, 2021.
  3. Masked siamese networks for label-efficient learning. arXiv preprint arXiv:2204.07141, 2022.
  4. Self-supervised learning from images with a joint-embedding predictive architecture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15619–15629, 2023.
  5. Efficient self-supervised learning with contextualized target representations for vision, speech and language. arXiv preprint arXiv:2212.07525, 2022a.
  6. Data2vec: A general framework for self-supervised learning in speech, vision and language. arXiv preprint arXiv:2202.03555, 2022b.
  7. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021.
  8. Slow feature analysis yields a rich repertoire of complex cell properties. Journal of vision, 5(6):9–9, 2005.
  9. Unsupervised learning of visual features by contrasting cluster assignments. arXiv preprint arXiv:2006.09882, 2020.
  10. Emerging properties in self-supervised vision transformers. arXiv preprint arXiv:2104.14294, 2021.
  11. A simple framework for contrastive learning of visual representations. preprint arXiv:2002.05709, 2020.
  12. Context autoencoder for self-supervised representation learning. arXiv preprint arXiv:2202.03026, 2022.
  13. An empirical study of training self-supervised vision transformers. arXiv preprint arXiv:2104.02057, 2021.
  14. Reproducible scaling laws for contrastive language-image learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2818–2829, 2023.
  15. Autoaugment: Learning augmentation policies from data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019.
  16. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  17. A large-scale study on unsupervised spatiotemporal representation learning. Proceedings of the IEEE conference on computer vision and pattern recognition, 2021.
  18. Masked autoencoders as spatiotemporal learners. Advances in neural information processing systems, 35:35946–35958, 2022.
  19. David J Field. What is the goal of sensory coding? Neural computation, 6(4):559–601, 1994.
  20. Learning representations by predicting bags of visual words. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6928–6938, 2020.
  21. Anticipative video transformer. In Proceedings of the IEEE/CVF international conference on computer vision, pages 13505–13515, 2021.
  22. Omnimae: Single model masked pretraining on images and videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10406–10417, 2023.
  23. Unsupervised learning of spatiotemporally coherent metrics. In Proceedings of the IEEE international conference on computer vision, pages 4086–4093, 2015.
  24. The" something something" video database for learning and evaluating visual common sense. In Proceedings of the IEEE international conference on computer vision, pages 5842–5850, 2017.
  25. Bootstrap your own latent: A new approach to self-supervised learning. arXiv preprint arXiv:2006.07733, 2020.
  26. Ava: A video dataset of spatio-temporally localized atomic visual actions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6047–6056, 2018.
  27. Siamese masked autoencoders. arXiv preprint arXiv:2305.14344, 2023.
  28. Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics. Journal of machine learning research, 13(2), 2012.
  29. Video representation learning by dense predictive coding. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pages 0–0, 2019.
  30. Memory-augmented dense predictive coding for video representation learning. In European conference on computer vision, pages 312–329. Springer, 2020.
  31. Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377, 2021.
  32. Geoffrey E Hinton. Connectionist learning procedures. In Machine learning, pages 555–610. Elsevier, 1989.
  33. Flavr: Flow-agnostic video representations for fast frame interpolation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2071–2082, 2023.
  34. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  35. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
  36. Extracting slow subspaces from natural videos leads to complex cells. In Artificial Neural Networks—ICANN 2001: International Conference Vienna, Austria, August 21–25, 2001 Proceedings 11, pages 1075–1080. Springer, 2001.
  37. Learning representations for automatic colorization. 2016.
  38. Colorization as a proxy task for visual understanding. 2017.
  39. Yann LeCun. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. 2022.
  40. Unsupervised representation learning by sorting sequences. In Proceedings of the IEEE international conference on computer vision, pages 667–676, 2017.
  41. Uniformer: Unified transformer for efficient spatiotemporal representation learning. arXiv preprint arXiv:2201.04676, 2022.
  42. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  43. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF international conference on computer vision, pages 2630–2640, 2019.
  44. Unsupervised learning of visual representations by solving jigsaw puzzles. In European conference on computer vision, pages 69–84. Springer, 2016.
  45. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  46. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
  47. Self-supervised video pretraining yields strong image representations. arXiv preprint arXiv:2210.06433, 2022.
  48. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2536–2544, 2016.
  49. Déja vu: Motion prediction in static images. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part III 13, pages 172–187. Springer, 2014.
  50. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  51. Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects. Nature neuroscience, 2(1):79–87, 1999.
  52. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
  53. Hiera: A hierarchical vision transformer without the bells-and-whistles. arXiv preprint arXiv:2306.00989, 2023.
  54. Only time can tell: Discovering temporal data for temporal modeling. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 535–544, 2021.
  55. Object perception, object-directed action, and physical knowledge in infancy. 1995.
  56. Unsupervised learning of video representations using lstms. In International conference on machine learning, pages 843–852. PMLR, 2015.
  57. Videobert: A joint model for video and language representation learning. In Proceedings of the IEEE/CVF international conference on computer vision, pages 7464–7473, 2019.
  58. Learning the predictability of the future. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12607–12617, 2021.
  59. Multiscale video pretraining for long-term activity forecasting. arXiv preprint arXiv:2307.12854, 2023.
  60. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. arXiv preprint arXiv:1703.01780, 2017.
  61. Understanding self-supervised learning dynamics without contrastive pairs. In International Conference on Machine Learning, pages 10268–10278. PMLR, 2021.
  62. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Advances in neural information processing systems, 35:10078–10093, 2022.
  63. The inaturalist species classification and detection dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8769–8778, 2018.
  64. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th International Conference on Machine Learning, ICML ’08, page 1096–1103, 2008.
  65. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of machine learning research, 11(12), 2010.
  66. Anticipating visual representations from unlabeled video. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 98–106, 2016.
  67. Learning a bi-stochastic data similarity matrix. In 2010 IEEE International Conference on Data Mining, pages 551–560. IEEE, 2010.
  68. Videomae v2: Scaling video masked autoencoders with dual masking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14549–14560, 2023a.
  69. Masked video distillation: Rethinking masked feature modeling for self-supervised video representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6312–6322, 2023b.
  70. Internvideo: General video foundation models via generative and discriminative learning. arXiv preprint arXiv:2212.03191, 2022.
  71. Slow feature analysis: Unsupervised learning of invariances. Neural computation, 14(4):715–770, 2002.
  72. Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3733–3742, 2018.
  73. Simmim: A simple framework for masked image modeling. arXiv preprint arXiv:2111.09886, 2021.
  74. Self-supervised spatiotemporal learning via video clip order prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10334–10343, 2019.
  75. Videoclip: Contrastive pre-training for zero-shot video-text understanding. arXiv preprint arXiv:2109.14084, 2021.
  76. Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022.
  77. Videoglue: Video general understanding evaluation of foundation models. arXiv preprint arXiv:2307.03166, 2023.
  78. Merlot reserve: Neural script knowledge through vision and language and sound. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16375–16387, 2022.
  79. Learning deep features for scene recognition using places database. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014. https://proceedings.neurips.cc/paper/2014/file/3fe94a002317b5f9259f82690aeea4cd-Paper.pdf.
  80. Ibot: Image bert pre-training with online tokenizer. arXiv preprint arXiv:2111.07832, 2021.
  81. Deep learning of invariant features via simulated fixations in video. Advances in neural information processing systems, 25, 2012.
Citations (45)

Summary

  • The paper demonstrates that feature prediction on raw video data yields robust visual representations without relying on pre-trained image encoders.
  • It details the V-JEPA methodology, adapting a joint embedding architecture for video by masking spatio-temporal patches to predict target features.
  • Evaluations on Kinetics-400, Something-Something-v2, and ImageNet-1K indicate competitive accuracies using a frozen backbone.

This paper investigates the efficacy of feature prediction as a primary objective for unsupervised visual representation learning directly from video data (2404.08471). The work introduces V-JEPA (Video Joint Embedding Predictive Architecture), a framework trained exclusively via a feature prediction loss. Notably, this approach eschews common techniques such as pre-trained image encoders (like CLIP or ImageNet pre-training), the use of text data, negative sampling strategies prevalent in contrastive learning, or pixel-level reconstruction objectives.

V-JEPA Methodology

The core idea of V-JEPA adapts the Joint Embedding Predictive Architecture (JEPA) paradigm, originally proposed for static images, to the video domain. The architecture comprises three main components:

  1. Context Encoder: Processes a spatio-temporal context block (a subset of video patches) and outputs a representation summarizing this visible context.
  2. Predictor: Takes the context representation as input and predicts the representations of target blocks (masked-out portions of the video). The predictor is typically a lighter-weight network (e.g., a shallow transformer) compared to the encoder.
  3. Target Encoder: Computes the target representations for the masked blocks. Crucially, the target encoder shares weights with the context encoder, but its gradients are stopped during backpropagation. This ensures the target representations remain stable within an optimization step, providing a consistent prediction objective.

The learning process involves the following steps:

  • A video clip is sampled and divided into spatio-temporal patches.
  • Multiple non-overlapping target blocks are masked out. The remaining patches form the context block.
  • The context encoder processes the visible context patches.
  • The target encoder processes the masked target patches to compute the target features.
  • The predictor takes the context representation and the positions of the target blocks as input and generates predicted features for each target block.
  • The loss function minimizes the L2 distance between the predicted features and the target features, aggregated over all target blocks.

L=imasked blocksPredictor(Encoder(Context),Posi)StopGrad(Encoder(Targeti))22L = \sum_{i \in \text{masked blocks}} || \text{Predictor}(\text{Encoder}(\text{Context}), \text{Pos}_i) - \text{StopGrad}(\text{Encoder}(\text{Target}_i)) ||_2^2

This self-supervised objective forces the model to learn internal representations that capture the underlying structure and dynamics within the video, enabling the prediction of missing spatio-temporal content at the feature level. The masking strategy encourages the model to develop high-level, semantic understanding rather than relying on low-level pixel correlations for reconstruction.

Training and Architecture Details

The V-JEPA models were trained on a large-scale dataset comprising 2 million unlabeled videos sourced from publicly available datasets. The paper utilized Vision Transformer (ViT) architectures as the backbone for the encoders. The largest model reported employed a ViT-H/16 architecture. The training exclusively relied on the feature prediction objective described above, without incorporating any external supervision or pre-trained weights. This "video-only" training regime is a key aspect of the work, aiming to demonstrate the power of learning directly from temporal dynamics and spatial context inherent in video data.

Evaluation and Performance

The efficacy of the learned representations was evaluated on a diverse set of downstream tasks spanning both image and video domains. A significant aspect of the evaluation protocol was the use of a frozen backbone. This means the pre-trained V-JEPA encoder weights were kept fixed, and only lightweight linear classifiers or adapters were trained on top for each specific downstream task. This evaluation methodology specifically probes the quality and generalizability of the learned representations themselves, independent of task-specific fine-tuning of the entire network.

The results demonstrate strong performance across tasks demanding different capabilities:

  • Action Recognition (Kinetics-400): The ViT-H/16 V-JEPA model achieved 81.9% top-1 accuracy. This indicates the learned representations effectively capture motion patterns and appearance cues relevant for classifying human actions.
  • Temporal Reasoning (Something-Something-v2): The same model obtained 72.2% top-1 accuracy. Success on SSv2 is particularly noteworthy as it heavily relies on understanding temporal relationships and object interactions, suggesting the feature prediction objective successfully internalized motion dynamics.
  • Image Classification (ImageNet-1K): The model achieved 77.9% top-1 accuracy under the linear probing protocol. This result is compelling because the model was trained exclusively on videos without any explicit image-based pre-training, yet it yields strong performance on a standard image benchmark, highlighting the versatility of the learned visual features.

These results collectively suggest that learning solely by predicting spatio-temporal features in video leads to robust and versatile representations applicable to both motion-centric and appearance-based tasks without requiring parameter adaptation.

Contributions and Significance

The primary contribution of this work is the empirical demonstration that a pure feature prediction objective, implemented within the V-JEPA framework, is sufficient for learning high-quality visual representations from large-scale video data. It challenges the necessity of prevalent techniques like contrastive learning (which requires careful negative sampling), generative reconstruction (which can focus on low-level details), or reliance on pre-trained image encoders or multi-modal data (like text).

By achieving strong performance on diverse benchmarks using a frozen backbone trained only on video feature prediction, the paper underscores the potential of self-supervised learning methods that focus on understanding and predicting the inherent structure of the visual world as presented in video sequences. The results indicate that motion and temporal consistency provide a powerful supervisory signal that can be effectively leveraged through predictive objectives at the feature level.

Conclusion

"Revisiting Feature Prediction for Learning Visual Representations from Video" (2404.08471) presents V-JEPA, a self-supervised learning approach based solely on feature prediction in videos. Trained without pre-trained encoders, negative samples, or reconstruction losses, V-JEPA demonstrates that predicting masked spatio-temporal features is a highly effective method for acquiring versatile visual representations. The strong performance achieved with frozen backbones on Kinetics-400, Something-Something-v2, and even ImageNet-1K validates the approach and suggests that feature prediction is a potent standalone objective for unsupervised representation learning from video.

Youtube Logo Streamline Icon: https://streamlinehq.com