VideoPrism: A Foundational Visual Encoder for Video Understanding (2402.13217v3)
Abstract: We introduce VideoPrism, a general-purpose video encoder that tackles diverse video understanding tasks with a single frozen model. We pretrain VideoPrism on a heterogeneous corpus containing 36M high-quality video-caption pairs and 582M video clips with noisy parallel text (e.g., ASR transcripts). The pretraining approach improves upon masked autoencoding by global-local distillation of semantic video embeddings and a token shuffling scheme, enabling VideoPrism to focus primarily on the video modality while leveraging the invaluable text associated with videos. We extensively test VideoPrism on four broad groups of video understanding tasks, from web video question answering to CV for science, achieving state-of-the-art performance on 31 out of 33 video understanding benchmarks. Our models are released at https://github.com/google-deepmind/videoprism.
- VATT: Transformers for multimodal self-supervised learning from raw video, audio and text. In NeurIPS, 2021.
- Alternating gradient descent and mixture-of-experts for integrated multimodal perception. In NeurIPS, 2023.
- Flamingo: A visual language model for few-shot learning. In NeurIPS, 2022.
- PaLM 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
- ViViT: A video vision transformer. In ICCV, 2021.
- Test of time: Instilling video-language models with a sense of time. In CVPR, 2023.
- Frozen in time: A joint video and image encoder for end-to-end retrieval. In ICCV, 2021.
- A CLIP-Hitchhiker’s guide to long video retrieval. arXiv preprint arXiv:2205.08508, 2022.
- VideoCon: Robust video-language alignment via contrast captions. arXiv preprint arXiv:2311.10111, 2023.
- BEiT: BERT pre-training of image transformers. In ICLR, 2022.
- On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
- AudioLM: A language modeling approach to audio generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31:2523–2533, 2023.
- Language models are few-shot learners. In NeurIPS, 2020.
- Revisiting the “video” in video-language understanding. In CVPR, 2022.
- Social behavior recognition in continuous video. In CVPR, 2012.
- ActivityNet: A large-scale video benchmark for human activity understanding. In CVPR, 2015.
- A short note about Kinetics-600. arXiv preprint arXiv:1808.01340, 2018.
- VideoLLM: Modeling video sequence with large language models. arXiv preprint arXiv:2305.13292, 2023a.
- Elaborative rehearsal for zero-shot action recognition. In ICCV, 2021.
- VAST: A vision-audio-subtitle-text omni-modality foundation model and dataset. In NeurIPS, 2023b.
- PaLI: A jointly-scaled multilingual language-image model. In ICLR, 2023c.
- VindLU: A recipe for effective video-and-language pretraining. In CVPR, 2023.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, 2019.
- Rescaling egocentric vision: Collection, pipeline and challenges for EPIC-KITCHENS-100. IJCV, 130(1):33–55, 2022.
- An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
- Detecting social actions of fruit flies. In ECCV, 2014.
- CLIP2Video: Mastering video-text retrieval via image CLIP. arXiv preprint arXiv:2106.11097, 2021.
- EVA: Exploring the limits of masked visual representation learning at scale. In CVPR, 2022.
- EVA-02: A visual representation for neon genesis. arXiv preprint arXiv:2303.11331, 2023.
- SlowFast networks for video recognition. In ICCV, 2019.
- A large-scale study on unsupervised spatiotemporal representation learning. In CVPR, 2021.
- Masked autoencoders as spatiotemporal learners. In NeurIPS, 2022.
- VIOLET: End-to-end video-language transformers with masked visual-token modeling. arXiv preprint arXiv:2111.12681, 2021.
- TALL: Temporal activity localization via language query. In ICCV, 2017.
- ImageBind: One embedding space to bind them all. In CVPR, 2023.
- The “something something” video database for learning and evaluating visual common sense. In ICCV, 2017a.
- Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In CVPR, 2017b.
- Ego4D: Around the world in 3,000 hours of egocentric video. In CVPR, 2022.
- AVA: A video dataset of spatio-temporally localized atomic visual actions. In CVPR, 2018.
- Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In AISTATS, 2010.
- Masked autoencoders are scalable vision learners. In CVPR, 2022.
- VLAB: Enhancing video language pre-training by feature adapting and blending. arXiv preprint arXiv:2305.13167, 2023.
- Probing image-language transformers for verb understanding. In ACL, 2021.
- LoRA: Low-rank adaptation of large language models. In ICLR, 2022.
- Multimodal pretraining for dense video captioning. arXiv preprint arXiv:2011.11760, 2020.
- Visual storytelling. In NAACL-HLT, 2016.
- Non-convex optimization for machine learning. Foundations and Trends® in Machine Learning, 10(3-4):142–363, 2017.
- Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, 2021.
- The Kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
- KABR: In-situ dataset for kenyan animal behavior recognition from drone videos. In WACV, 2024.
- Adam: A method for stochastic optimization. In ICLR, 2015.
- Set Transformer: A framework for attention-based permutation-invariant neural networks. In ICML, 2019.
- Revealing single frame bias for video-and-language learning. In ACL, 2023.
- The AVA-Kinetics localized human actions video dataset. arXiv preprint arXiv:2005.00214, 2020.
- BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022.
- BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023a.
- Unmasked teacher: Towards training-efficient video foundation models. In ICCV, 2023b.
- LAVENDER: Unifying video-language understanding as masked language modeling. In CVPR, 2023c.
- DeCap: Decoding CLIP latents for zero-shot captioning via text-only training. In ICLR, 2023d.
- RESOUND: Towards action recognition without representation bias. In ECCV, 2018.
- Scaling language-image pre-training via masking. In CVPR, 2023e.
- LLaMA-VID: An image is worth 2 tokens in large language models. arXiv preprint arXiv:2311.17043, 2023f.
- Video-LLaVA: Learning united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122, 2023a.
- Egocentric video-language pretraining. In NeurIPS, 2022.
- Match, expand and improve: Unsupervised finetuning for zero-shot action recognition with language knowledge. In ICCV, 2023b.
- Revisiting temporal modeling for clip-based image-to-video knowledge transferring. In CVPR, 2023.
- Decoupled weight decay regularization. In ICLR, 2019.
- CLIP4Clip: An empirical study of clip for end to end video clip retrieval and captioning. Neurocomputing, 508:293–304, 2022.
- ChimpACT: A longitudinal dataset for understanding chimpanzee behaviors. arXiv preprint arXiv:2310.16447, 2023.
- Video-ChatGPT: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424, 2023.
- Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation, volume 24, pp. 109–165. Elsevier, 1989.
- HowTo100M: Learning a text-video embedding by watching hundred million narrated video clips. In ICCV, 2019.
- Verbs in action: Improving verb understanding in video-language models. In ICCV, 2023.
- Moments in Time dataset: one million videos for event understanding. IEEE TPAMI, 42(2):502–508, 2019.
- Spoken Moments: Learning joint audio-visual representations from video descriptions. In CVPR, 2021.
- Learning audio-video modalities from image captions. In ECCV, 2022.
- Expanding language-image pretrained models for general video recognition. In ECCV, 2022.
- Unsupervised learning of visual representations by solving Jigsaw puzzles. In ECCV, 2016.
- DINOv2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
- BEiT v2: Masked image modeling with vector-quantized visual tokenizers. arXiv preprint arXiv:2208.06366, 2022.
- Mirasol3B: A multimodal autoregressive model for time-aligned and contextual modalities. arXiv preprint arXiv:2311.05698, 2023.
- You Described, We Archived: A rich audio description dataset. Journal on Technology and Persons with Disabilities, 2023.
- Spatiotemporal contrastive video representation learning. In CVPR, 2021.
- On temporal granularity in self-supervised video representation learning. In BMVC, 2022.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Broaden your views for self-supervised video learning. In ICCV, 2021.
- Grounding action descriptions in videos. Transactions of the Association for Computational Linguistics, 1:25–36, 2013.
- Faster R-CNN: Towards real-time object detection with region proposal networks. In NeurIPS, 2015.
- Only time can tell: Discovering temporal data for temporal modeling. In WACV, 2021.
- Adafactor: Adaptive learning rates with sublinear memory cost. In ICML, 2018.
- Hollywood in Homes: Crowdsourcing data collection for activity understanding. In ECCV, 2016.
- Semi-supervised action recognition with temporal contrastive learning. In CVPR, 2021.
- FLAVA: A foundational language and vision alignment model. In CVPR, 2022.
- Learning video representations from textual web supervision. arXiv preprint arXiv:2007.14937, 2020.
- The multi-agent behavior dataset: Mouse dyadic social interactions. In NeurIPS D&B, 2021a.
- Task programming: Learning data efficient behavior representations. In CVPR, 2021b.
- Equalization loss for long-tailed object recognition. In CVPR, 2020.
- COIN: A large-scale dataset for comprehensive instructional video analysis. In CVPR, 2019.
- Video understanding with large language models: A survey. arXiv preprint arXiv:2312.17432, 2023.
- VideoMAE: Masked autoencoders are data-efficient learners for self-supervised video pre-training. In NeurIPS, 2022.
- Attention is all you need. In NeurIPS, 2017.
- Connecting vision and language with video localized narratives. In CVPR, 2023.
- OmniVL: One foundation model for image-language and video-language tasks. In NeurIPS, 2022a.
- All in one: Exploring unified video-language pre-training. In CVPR, 2023a.
- VideoMAE v2: Scaling video masked autoencoders with dual masking. In CVPR, 2023b.
- BEVT: BERT pretraining of video transformers. In CVPR, 2022b.
- Masked video distillation: Rethinking masked feature modeling for self-supervised video representation learning. In CVPR, 2023c.
- Image as a foreign language: BEiT pretraining for vision and vision-language tasks. In CVPR, 2023d.
- VATEX: A large-scale, high-quality multilingual dataset for video-and-language research. In ICCV, 2019.
- InternVideo: General video foundation models via generative and discriminative learning. arXiv preprint arXiv:2212.03191, 2022c.
- InternVid: A large-scale video-text dataset for multimodal understanding and generation. arXiv preprint arXiv:2307.06942, 2023e.
- Language models with image descriptors are strong few-shot video-language learners. In NeurIPS, 2022d.
- Paxion: Patching action knowledge in video-language foundation models. In NeurIPS, 2023f.
- Masked feature prediction for self-supervised visual pre-training. In CVPR, 2022.
- Godiva: Generating open-domain videos from natural descriptions. arXiv preprint arXiv:2104.14806, 2021.
- Revisiting classifier: Transferring vision-language models for video recognition. In AAAI, 2023.
- Verb semantics and lexical selection. In ACL, 1994.
- Building an open-vocabulary video CLIP model with better architectures, optimization and data. IEEE TPAMI, 2024.
- NExT-QA: Next phase of question-answering to explaining temporal actions. In CVPR, 2021.
- Spatiotemporally discriminative video-language pre-training with text grounding. arXiv preprint arXiv:2303.16341, 2023.
- Video question answering via gradually refined attention over appearance and motion. In ACM MM, 2017.
- VideoCLIP: Contrastive pre-training for zero-shot video-text understanding. In EMNLP, 2021.
- mPLUG-2: A modularized multi-modal foundation model across text, image and video. In ICML, 2023.
- MSR-VTT: A large video description dataset for bridging video and language. In CVPR, 2016.
- G-TAD: Sub-graph localization for temporal action detection. In CVPR, 2020.
- Advancing high-resolution video-language representation with large-scale video transcriptions. In CVPR, 2022.
- CLIP-ViP: Adapting pre-trained image-text model to video-language representation alignment. In ICLR, 2023.
- VideoCoCa: Video-text modeling with zero-shot transfer from contrastive captioners. arXiv preprint arXiv:2212.04979, 2022.
- Zero-shot video question answering via frozen bidirectional language models. In NeurIPS, 2022.
- What you see is what you read? improving text-image alignment evaluation. In NeurIPS, 2023.
- HiTeA: Hierarchical temporal-aware video-language pre-training. In ICCV, 2023.
- CoCa: Contrastive captioners are image-text foundation models. TMLR, 2022. ISSN 2835-8856.
- Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, 2021.
- Contextualized spatio-temporal contrastive learning with self-supervision. In CVPR, 2022.
- VideoGLUE: Video general understanding evaluation of foundation models. arXiv preprint arXiv:2307.03166, 2023.
- MERLOT: Multimodal neural script knowledge models. In NeurIPS, 2021.
- MERLOT Reserve: Neural script knowledge through vision and language and sound. In CVPR, 2022.
- Scaling vision transformers. In CVPR, 2022a.
- LiT: Zero-shot transfer with locked-image text tuning. In CVPR, 2022b.
- Video-LLaMA: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858, 2023.
- Distilling vision-language models on millions of videos. arXiv preprint arXiv:2401.06129, 2024.
- iBOT: Image BERT pre-training with online tokenizer. In ICLR, 2022.
- Towards automatic learning of procedures from web instructional videos. In AAAI, 2018.
- LanguageBind: Extending video-language pretraining to N-modality by language-based semantic alignment. In ICLR, 2024.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.