VideoPrism: A Foundational Visual Encoder for Video Understanding (2402.13217v3)

Published 20 Feb 2024 in cs.CV and cs.AI

Abstract: We introduce VideoPrism, a general-purpose video encoder that tackles diverse video understanding tasks with a single frozen model. We pretrain VideoPrism on a heterogeneous corpus containing 36M high-quality video-caption pairs and 582M video clips with noisy parallel text (e.g., ASR transcripts). The pretraining approach improves upon masked autoencoding by global-local distillation of semantic video embeddings and a token shuffling scheme, enabling VideoPrism to focus primarily on the video modality while leveraging the invaluable text associated with videos. We extensively test VideoPrism on four broad groups of video understanding tasks, from web video question answering to CV for science, achieving state-of-the-art performance on 31 out of 33 video understanding benchmarks. Our models are released at https://github.com/google-deepmind/videoprism.

References (144)

Citations (18)

View on Semantic Scholar

Summary

The paper introduces VideoPrism, a novel video encoder that sets new benchmarks across 30 of 33 evaluated tasks.
It employs a two-stage pretraining approach that merges vision-language contrastive learning with masked video modeling for deep semantic capture.
Its scalable Vision Transformer architecture, trained on 36M captioned videos and 582M clips, ensures robust performance on diverse video analysis tasks.

Introducing VideoPrism: A General-purpose Video Encoder Achieving State-of-the-Art Performance across a Wide Spectrum of Video Understanding Tasks

Overview of VideoPrism

Within the landscape of video foundation models (ViFMs), the quest for a truly generalizable and high-performing video encoder has been ongoing. In response to this challenge, the paper introduces VideoPrism, a distinctive video encoder designed to comprehend and analyze videos across a broad spectrum of tasks, such as classification, localization, retrieval, captioning, and question answering. Impressively, VideoPrism sets a new benchmark by achieving state-of-the-art performance on 30 out of 33 evaluated video understanding benchmarks, spanning across domains from web videos to specific scientific datasets in fields like neuroscience and ecology.

Pretraining Strategy and Architectural Insights

Data Preparation and Model Training

A core insight from the VideoPrism development is the significance of pretraining data for foundation models. The paper articulates a strategy for assembling a large and varied pretraining dataset, consisting of 36 million high-quality video-caption pairs alongside 582 million video clips paired with noisier text data, such as Automated Speech Recognition (ASR) transcripts. This impressive dataset underpins VideoPrism's training regime, allowing for a nuanced capture of both motion and appearance cues essential for understanding complex video content. Furthermore, the paper describes an innovative two-stage model training approach that combines vision-language contrastive learning with masked video modeling, enhanced by token shuffling and global-local distillation mechanisms. These methodologies ensure VideoPrism efficiently captures video semantics at multiple granularities.

Architectural Design

Drawing from the strengths of the Vision Transformer (ViT) architecture, VideoPrism incorporates a factorized design capable of handling spatiotemporal dimensions effectively, which is essential for tasks requiring fine-grained video understanding. The experimentation with two configurations of VideoPrism (VideoPrism-g and VideoPrism-B) demonstrates scalable model performance that correlates with network size, highlighting the role of model capacity in achieving superior results across varied benchmarks.

Performance and Evaluation

VideoPrism is extensively evaluated across four major categories of video understanding tasks. Its performance is particularly noteworthy in scenarios requiring the encoding of both appearance and motion information, where the model demonstrates remarkable generalizability and robustness across different datasets. Additionally, VideoPrism's efficacy in zero-shot settings for video-text retrieval and video question answering tasks showcases its potential for practical real-world applications where training data for specific tasks may not be available.

Future Directions and Potential Impacts

The development and success of VideoPrism illuminate several pathways for future research in video understanding. Notably, the model's scalable performance and general applicability suggest further exploration into task-specific adapters or finetuning approaches could yield even more pronounced benefits across diverse video analysis applications. Moreover, VideoPrism's foundational approach points to exciting possibilities in domains such as scientific research, where advanced video understanding capabilities can accelerate discovery and innovation.

Conclusion

VideoPrism represents a significant advancement in the field of video foundation models, achieving unparalleled performance across an extensive range of video understanding tasks. By meticulously curating a large-scale pretraining dataset and leveraging a novel two-stage training pipeline, along with an effective Vision Transformer-based architecture, VideoPrism sets a new standard for what is achievable in the field of video analysis. As the community continues to explore and expand upon this work, the potential for transformative impacts across both commercial and scientific domains is immense.