Papers
Topics
Authors
Recent
2000 character limit reached

VideoPrism: A Foundational Visual Encoder for Video Understanding (2402.13217v3)

Published 20 Feb 2024 in cs.CV and cs.AI

Abstract: We introduce VideoPrism, a general-purpose video encoder that tackles diverse video understanding tasks with a single frozen model. We pretrain VideoPrism on a heterogeneous corpus containing 36M high-quality video-caption pairs and 582M video clips with noisy parallel text (e.g., ASR transcripts). The pretraining approach improves upon masked autoencoding by global-local distillation of semantic video embeddings and a token shuffling scheme, enabling VideoPrism to focus primarily on the video modality while leveraging the invaluable text associated with videos. We extensively test VideoPrism on four broad groups of video understanding tasks, from web video question answering to CV for science, achieving state-of-the-art performance on 31 out of 33 video understanding benchmarks. Our models are released at https://github.com/google-deepmind/videoprism.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (144)
  1. VATT: Transformers for multimodal self-supervised learning from raw video, audio and text. In NeurIPS, 2021.
  2. Alternating gradient descent and mixture-of-experts for integrated multimodal perception. In NeurIPS, 2023.
  3. Flamingo: A visual language model for few-shot learning. In NeurIPS, 2022.
  4. PaLM 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
  5. ViViT: A video vision transformer. In ICCV, 2021.
  6. Test of time: Instilling video-language models with a sense of time. In CVPR, 2023.
  7. Frozen in time: A joint video and image encoder for end-to-end retrieval. In ICCV, 2021.
  8. A CLIP-Hitchhiker’s guide to long video retrieval. arXiv preprint arXiv:2205.08508, 2022.
  9. VideoCon: Robust video-language alignment via contrast captions. arXiv preprint arXiv:2311.10111, 2023.
  10. BEiT: BERT pre-training of image transformers. In ICLR, 2022.
  11. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
  12. AudioLM: A language modeling approach to audio generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31:2523–2533, 2023.
  13. Language models are few-shot learners. In NeurIPS, 2020.
  14. Revisiting the “video” in video-language understanding. In CVPR, 2022.
  15. Social behavior recognition in continuous video. In CVPR, 2012.
  16. ActivityNet: A large-scale video benchmark for human activity understanding. In CVPR, 2015.
  17. A short note about Kinetics-600. arXiv preprint arXiv:1808.01340, 2018.
  18. VideoLLM: Modeling video sequence with large language models. arXiv preprint arXiv:2305.13292, 2023a.
  19. Elaborative rehearsal for zero-shot action recognition. In ICCV, 2021.
  20. VAST: A vision-audio-subtitle-text omni-modality foundation model and dataset. In NeurIPS, 2023b.
  21. PaLI: A jointly-scaled multilingual language-image model. In ICLR, 2023c.
  22. VindLU: A recipe for effective video-and-language pretraining. In CVPR, 2023.
  23. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, 2019.
  24. Rescaling egocentric vision: Collection, pipeline and challenges for EPIC-KITCHENS-100. IJCV, 130(1):33–55, 2022.
  25. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
  26. Detecting social actions of fruit flies. In ECCV, 2014.
  27. CLIP2Video: Mastering video-text retrieval via image CLIP. arXiv preprint arXiv:2106.11097, 2021.
  28. EVA: Exploring the limits of masked visual representation learning at scale. In CVPR, 2022.
  29. EVA-02: A visual representation for neon genesis. arXiv preprint arXiv:2303.11331, 2023.
  30. SlowFast networks for video recognition. In ICCV, 2019.
  31. A large-scale study on unsupervised spatiotemporal representation learning. In CVPR, 2021.
  32. Masked autoencoders as spatiotemporal learners. In NeurIPS, 2022.
  33. VIOLET: End-to-end video-language transformers with masked visual-token modeling. arXiv preprint arXiv:2111.12681, 2021.
  34. TALL: Temporal activity localization via language query. In ICCV, 2017.
  35. ImageBind: One embedding space to bind them all. In CVPR, 2023.
  36. The “something something” video database for learning and evaluating visual common sense. In ICCV, 2017a.
  37. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In CVPR, 2017b.
  38. Ego4D: Around the world in 3,000 hours of egocentric video. In CVPR, 2022.
  39. AVA: A video dataset of spatio-temporally localized atomic visual actions. In CVPR, 2018.
  40. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In AISTATS, 2010.
  41. Masked autoencoders are scalable vision learners. In CVPR, 2022.
  42. VLAB: Enhancing video language pre-training by feature adapting and blending. arXiv preprint arXiv:2305.13167, 2023.
  43. Probing image-language transformers for verb understanding. In ACL, 2021.
  44. LoRA: Low-rank adaptation of large language models. In ICLR, 2022.
  45. Multimodal pretraining for dense video captioning. arXiv preprint arXiv:2011.11760, 2020.
  46. Visual storytelling. In NAACL-HLT, 2016.
  47. Non-convex optimization for machine learning. Foundations and Trends® in Machine Learning, 10(3-4):142–363, 2017.
  48. Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, 2021.
  49. The Kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
  50. KABR: In-situ dataset for kenyan animal behavior recognition from drone videos. In WACV, 2024.
  51. Adam: A method for stochastic optimization. In ICLR, 2015.
  52. Set Transformer: A framework for attention-based permutation-invariant neural networks. In ICML, 2019.
  53. Revealing single frame bias for video-and-language learning. In ACL, 2023.
  54. The AVA-Kinetics localized human actions video dataset. arXiv preprint arXiv:2005.00214, 2020.
  55. BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022.
  56. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023a.
  57. Unmasked teacher: Towards training-efficient video foundation models. In ICCV, 2023b.
  58. LAVENDER: Unifying video-language understanding as masked language modeling. In CVPR, 2023c.
  59. DeCap: Decoding CLIP latents for zero-shot captioning via text-only training. In ICLR, 2023d.
  60. RESOUND: Towards action recognition without representation bias. In ECCV, 2018.
  61. Scaling language-image pre-training via masking. In CVPR, 2023e.
  62. LLaMA-VID: An image is worth 2 tokens in large language models. arXiv preprint arXiv:2311.17043, 2023f.
  63. Video-LLaVA: Learning united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122, 2023a.
  64. Egocentric video-language pretraining. In NeurIPS, 2022.
  65. Match, expand and improve: Unsupervised finetuning for zero-shot action recognition with language knowledge. In ICCV, 2023b.
  66. Revisiting temporal modeling for clip-based image-to-video knowledge transferring. In CVPR, 2023.
  67. Decoupled weight decay regularization. In ICLR, 2019.
  68. CLIP4Clip: An empirical study of clip for end to end video clip retrieval and captioning. Neurocomputing, 508:293–304, 2022.
  69. ChimpACT: A longitudinal dataset for understanding chimpanzee behaviors. arXiv preprint arXiv:2310.16447, 2023.
  70. Video-ChatGPT: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424, 2023.
  71. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation, volume 24, pp. 109–165. Elsevier, 1989.
  72. HowTo100M: Learning a text-video embedding by watching hundred million narrated video clips. In ICCV, 2019.
  73. Verbs in action: Improving verb understanding in video-language models. In ICCV, 2023.
  74. Moments in Time dataset: one million videos for event understanding. IEEE TPAMI, 42(2):502–508, 2019.
  75. Spoken Moments: Learning joint audio-visual representations from video descriptions. In CVPR, 2021.
  76. Learning audio-video modalities from image captions. In ECCV, 2022.
  77. Expanding language-image pretrained models for general video recognition. In ECCV, 2022.
  78. Unsupervised learning of visual representations by solving Jigsaw puzzles. In ECCV, 2016.
  79. DINOv2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
  80. BEiT v2: Masked image modeling with vector-quantized visual tokenizers. arXiv preprint arXiv:2208.06366, 2022.
  81. Mirasol3B: A multimodal autoregressive model for time-aligned and contextual modalities. arXiv preprint arXiv:2311.05698, 2023.
  82. You Described, We Archived: A rich audio description dataset. Journal on Technology and Persons with Disabilities, 2023.
  83. Spatiotemporal contrastive video representation learning. In CVPR, 2021.
  84. On temporal granularity in self-supervised video representation learning. In BMVC, 2022.
  85. Learning transferable visual models from natural language supervision. In ICML, 2021.
  86. Broaden your views for self-supervised video learning. In ICCV, 2021.
  87. Grounding action descriptions in videos. Transactions of the Association for Computational Linguistics, 1:25–36, 2013.
  88. Faster R-CNN: Towards real-time object detection with region proposal networks. In NeurIPS, 2015.
  89. Only time can tell: Discovering temporal data for temporal modeling. In WACV, 2021.
  90. Adafactor: Adaptive learning rates with sublinear memory cost. In ICML, 2018.
  91. Hollywood in Homes: Crowdsourcing data collection for activity understanding. In ECCV, 2016.
  92. Semi-supervised action recognition with temporal contrastive learning. In CVPR, 2021.
  93. FLAVA: A foundational language and vision alignment model. In CVPR, 2022.
  94. Learning video representations from textual web supervision. arXiv preprint arXiv:2007.14937, 2020.
  95. The multi-agent behavior dataset: Mouse dyadic social interactions. In NeurIPS D&B, 2021a.
  96. Task programming: Learning data efficient behavior representations. In CVPR, 2021b.
  97. Equalization loss for long-tailed object recognition. In CVPR, 2020.
  98. COIN: A large-scale dataset for comprehensive instructional video analysis. In CVPR, 2019.
  99. Video understanding with large language models: A survey. arXiv preprint arXiv:2312.17432, 2023.
  100. VideoMAE: Masked autoencoders are data-efficient learners for self-supervised video pre-training. In NeurIPS, 2022.
  101. Attention is all you need. In NeurIPS, 2017.
  102. Connecting vision and language with video localized narratives. In CVPR, 2023.
  103. OmniVL: One foundation model for image-language and video-language tasks. In NeurIPS, 2022a.
  104. All in one: Exploring unified video-language pre-training. In CVPR, 2023a.
  105. VideoMAE v2: Scaling video masked autoencoders with dual masking. In CVPR, 2023b.
  106. BEVT: BERT pretraining of video transformers. In CVPR, 2022b.
  107. Masked video distillation: Rethinking masked feature modeling for self-supervised video representation learning. In CVPR, 2023c.
  108. Image as a foreign language: BEiT pretraining for vision and vision-language tasks. In CVPR, 2023d.
  109. VATEX: A large-scale, high-quality multilingual dataset for video-and-language research. In ICCV, 2019.
  110. InternVideo: General video foundation models via generative and discriminative learning. arXiv preprint arXiv:2212.03191, 2022c.
  111. InternVid: A large-scale video-text dataset for multimodal understanding and generation. arXiv preprint arXiv:2307.06942, 2023e.
  112. Language models with image descriptors are strong few-shot video-language learners. In NeurIPS, 2022d.
  113. Paxion: Patching action knowledge in video-language foundation models. In NeurIPS, 2023f.
  114. Masked feature prediction for self-supervised visual pre-training. In CVPR, 2022.
  115. Godiva: Generating open-domain videos from natural descriptions. arXiv preprint arXiv:2104.14806, 2021.
  116. Revisiting classifier: Transferring vision-language models for video recognition. In AAAI, 2023.
  117. Verb semantics and lexical selection. In ACL, 1994.
  118. Building an open-vocabulary video CLIP model with better architectures, optimization and data. IEEE TPAMI, 2024.
  119. NExT-QA: Next phase of question-answering to explaining temporal actions. In CVPR, 2021.
  120. Spatiotemporally discriminative video-language pre-training with text grounding. arXiv preprint arXiv:2303.16341, 2023.
  121. Video question answering via gradually refined attention over appearance and motion. In ACM MM, 2017.
  122. VideoCLIP: Contrastive pre-training for zero-shot video-text understanding. In EMNLP, 2021.
  123. mPLUG-2: A modularized multi-modal foundation model across text, image and video. In ICML, 2023.
  124. MSR-VTT: A large video description dataset for bridging video and language. In CVPR, 2016.
  125. G-TAD: Sub-graph localization for temporal action detection. In CVPR, 2020.
  126. Advancing high-resolution video-language representation with large-scale video transcriptions. In CVPR, 2022.
  127. CLIP-ViP: Adapting pre-trained image-text model to video-language representation alignment. In ICLR, 2023.
  128. VideoCoCa: Video-text modeling with zero-shot transfer from contrastive captioners. arXiv preprint arXiv:2212.04979, 2022.
  129. Zero-shot video question answering via frozen bidirectional language models. In NeurIPS, 2022.
  130. What you see is what you read? improving text-image alignment evaluation. In NeurIPS, 2023.
  131. HiTeA: Hierarchical temporal-aware video-language pre-training. In ICCV, 2023.
  132. CoCa: Contrastive captioners are image-text foundation models. TMLR, 2022. ISSN 2835-8856.
  133. Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, 2021.
  134. Contextualized spatio-temporal contrastive learning with self-supervision. In CVPR, 2022.
  135. VideoGLUE: Video general understanding evaluation of foundation models. arXiv preprint arXiv:2307.03166, 2023.
  136. MERLOT: Multimodal neural script knowledge models. In NeurIPS, 2021.
  137. MERLOT Reserve: Neural script knowledge through vision and language and sound. In CVPR, 2022.
  138. Scaling vision transformers. In CVPR, 2022a.
  139. LiT: Zero-shot transfer with locked-image text tuning. In CVPR, 2022b.
  140. Video-LLaMA: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858, 2023.
  141. Distilling vision-language models on millions of videos. arXiv preprint arXiv:2401.06129, 2024.
  142. iBOT: Image BERT pre-training with online tokenizer. In ICLR, 2022.
  143. Towards automatic learning of procedures from web instructional videos. In AAAI, 2018.
  144. LanguageBind: Extending video-language pretraining to N-modality by language-based semantic alignment. In ICLR, 2024.
Citations (18)

Summary

  • The paper introduces VideoPrism, a novel video encoder that sets new benchmarks across 30 of 33 evaluated tasks.
  • It employs a two-stage pretraining approach that merges vision-language contrastive learning with masked video modeling for deep semantic capture.
  • Its scalable Vision Transformer architecture, trained on 36M captioned videos and 582M clips, ensures robust performance on diverse video analysis tasks.

Introducing VideoPrism: A General-purpose Video Encoder Achieving State-of-the-Art Performance across a Wide Spectrum of Video Understanding Tasks

Overview of VideoPrism

Within the landscape of video foundation models (ViFMs), the quest for a truly generalizable and high-performing video encoder has been ongoing. In response to this challenge, the paper introduces VideoPrism, a distinctive video encoder designed to comprehend and analyze videos across a broad spectrum of tasks, such as classification, localization, retrieval, captioning, and question answering. Impressively, VideoPrism sets a new benchmark by achieving state-of-the-art performance on 30 out of 33 evaluated video understanding benchmarks, spanning across domains from web videos to specific scientific datasets in fields like neuroscience and ecology.

Pretraining Strategy and Architectural Insights

Data Preparation and Model Training

A core insight from the VideoPrism development is the significance of pretraining data for foundation models. The paper articulates a strategy for assembling a large and varied pretraining dataset, consisting of 36 million high-quality video-caption pairs alongside 582 million video clips paired with noisier text data, such as Automated Speech Recognition (ASR) transcripts. This impressive dataset underpins VideoPrism's training regime, allowing for a nuanced capture of both motion and appearance cues essential for understanding complex video content. Furthermore, the paper describes an innovative two-stage model training approach that combines vision-language contrastive learning with masked video modeling, enhanced by token shuffling and global-local distillation mechanisms. These methodologies ensure VideoPrism efficiently captures video semantics at multiple granularities.

Architectural Design

Drawing from the strengths of the Vision Transformer (ViT) architecture, VideoPrism incorporates a factorized design capable of handling spatiotemporal dimensions effectively, which is essential for tasks requiring fine-grained video understanding. The experimentation with two configurations of VideoPrism (VideoPrism-g and VideoPrism-B) demonstrates scalable model performance that correlates with network size, highlighting the role of model capacity in achieving superior results across varied benchmarks.

Performance and Evaluation

VideoPrism is extensively evaluated across four major categories of video understanding tasks. Its performance is particularly noteworthy in scenarios requiring the encoding of both appearance and motion information, where the model demonstrates remarkable generalizability and robustness across different datasets. Additionally, VideoPrism's efficacy in zero-shot settings for video-text retrieval and video question answering tasks showcases its potential for practical real-world applications where training data for specific tasks may not be available.

Future Directions and Potential Impacts

The development and success of VideoPrism illuminate several pathways for future research in video understanding. Notably, the model's scalable performance and general applicability suggest further exploration into task-specific adapters or finetuning approaches could yield even more pronounced benefits across diverse video analysis applications. Moreover, VideoPrism's foundational approach points to exciting possibilities in domains such as scientific research, where advanced video understanding capabilities can accelerate discovery and innovation.

Conclusion

VideoPrism represents a significant advancement in the field of video foundation models, achieving unparalleled performance across an extensive range of video understanding tasks. By meticulously curating a large-scale pretraining dataset and leveraging a novel two-stage training pipeline, along with an effective Vision Transformer-based architecture, VideoPrism sets a new standard for what is achievable in the field of video analysis. As the community continues to explore and expand upon this work, the potential for transformative impacts across both commercial and scientific domains is immense.

Whiteboard

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 12 tweets with 430 likes about this paper.