A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames (2312.07395v1)
Abstract: Understanding long, real-world videos requires modeling of long-range visual dependencies. To this end, we explore video-first architectures, building on the common paradigm of transferring large-scale, image--text models to video via shallow temporal fusion. However, we expose two limitations to the approach: (1) decreased spatial capabilities, likely due to poor video--language alignment in standard video datasets, and (2) higher memory consumption, bottlenecking the number of frames that can be processed. To mitigate the memory bottleneck, we systematically analyze the memory/accuracy trade-off of various efficient methods: factorized attention, parameter-efficient image-to-video adaptation, input masking, and multi-resolution patchification. Surprisingly, simply masking large portions of the video (up to 75%) during contrastive pre-training proves to be one of the most robust ways to scale encoders to videos up to 4.3 minutes at 1 FPS. Our simple approach for training long video-to-text models, which scales to 1B parameters, does not add new architectural complexity and is able to outperform the popular paradigm of using much larger LLMs as an information aggregator over segment-based information on benchmarks with long-range temporal dependencies (YouCook2, EgoSchema).
- Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716β23736, 2022.
- Vivit: A video vision transformer. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6836β6846, 2021.
- Is space-time attention all you need for video understanding? In ICML, pageΒ 4, 2021a.
- Is space-time attention all you need for video understanding? In ICML, pageΒ 4, 2021b.
- Revisiting theβ videoβ in video-language understanding. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2917β2927, 2022.
- Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the ieee conference on computer vision and pattern recognition, pages 961β970, 2015.
- Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299β6308, 2017.
- Litevl: Efficient video-language learning with enhanced spatial-temporal modeling. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 7985β7997, 2022.
- Parameter-efficient fine-tuning design spaces. arXiv preprint arXiv:2301.01821, 2023a.
- Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
- Pali-3 vision language models: Smaller, faster, stronger. arXiv preprint arXiv:2310.09199, 2023b.
- Vindlu: A recipe for effective video-and-language pretraining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10739β10750, 2023.
- Spatiotemporal residual networks for video action recognition. Advances in neural information processing systems, 2, 2016.
- A thousand frames in just a few words: Lingual description of videos through latent topics and sparse object stitching. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2634β2641, 2013.
- An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2020.
- Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023.
- Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6202β6211, 2019.
- Violet: End-to-end video-language transformers with masked visual-token modeling. arXiv preprint arXiv:2111.12681, 2021.
- An empirical study of end-to-end video-language transformers with masked visual modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22898β22909, 2023.
- Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 297β304. JMLR Workshop and Conference Proceedings, 2010.
- Turbo training with token dropout. arXiv preprint arXiv:2210.04889, 2022.
- Masked autoencoders are scalable vision learners. 2022 ieee. In CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15979β15988, 2021.
- Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
- Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790β2799. PMLR, 2019.
- Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2021.
- Language is not all you need: Aligning perception with language models. arXiv preprint arXiv:2302.14045, 2023.
- Long movie clip classification with state-space video models. In European Conference on Computer Vision, pages 87β104. Springer, 2022.
- Perceiver: General perception with iterative attention. In International conference on machine learning, pages 4651β4664. PMLR, 2021.
- Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, pages 4904β4916. PMLR, 2021.
- Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451, 2020.
- Dense-captioning events in videos. In Proceedings of the IEEE international conference on computer vision, pages 706β715, 2017.
- Revealing single frame bias for video-and-language learning. arXiv preprint arXiv:2206.03428, 2022.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023a.
- Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355, 2023b.
- Lavender: Unifying video-language understanding as masked language modeling. arXiv preprint arXiv:2206.07160, 2022.
- Lavender: Unifying video-language understanding as masked language modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23119β23129, 2023c.
- Mm-vid: Advancing video understanding with gpt-4v(ision), 2023a.
- Smaug: Sparse masked autoencoder for efficient video-language pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2459β2469, 2023b.
- Revisiting temporal modeling for clip-based image-to-video knowledge transferring. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6555β6564, 2023.
- Video swin transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3202β3211, 2022.
- Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning. Neurocomputing, 508:293β304, 2022.
- Rethinking resolution in the context of efficient video recognition. Advances in Neural Information Processing Systems, 35:37865β37877, 2022a.
- Simvtp: Simple video text pre-training with masked autoencoders. arXiv preprint arXiv:2212.03490, 2022b.
- Egoschema: A diagnostic benchmark for very long-form video language understanding. arXiv preprint arXiv:2308.09126, 2023.
- Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2630β2640, 2019.
- End-to-end learning of visual representations from uncurated instructional videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9879β9889, 2020.
- Learning audio-video modalities from image captions. In Computer VisionβECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23β27, 2022, Proceedings, Part XIV, pages 407β426. Springer, 2022.
- Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
- St-adapter: Parameter-efficient image-to-video transfer learning. Advances in Neural Information Processing Systems, 35:26462β26477, 2022.
- Perception test: A diagnostic benchmark for multimodal video models. arXiv preprint arXiv:2305.13786, 2023.
- Yarn: Efficient context window extension of large language models. arXiv preprint arXiv:2309.00071, 2023.
- Rethinking video vits: Sparse video tubes for joint image and video learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2214β2224, 2023.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748β8763. PMLR, 2021.
- Token turing machines. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19070β19081, 2023.
- Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556β2565, 2018.
- Two-stream convolutional networks for action recognition in videos. Advances in neural information processing systems, 27, 2014.
- Flava: A foundational language and vision alignment model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15638β15650, 2022.
- Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. In Advances in Neural Information Processing Systems, 2022.
- A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 6450β6459, 2018.
- Image captioners are scalable vision learners too. arXiv preprint arXiv:2306.07915, 2023.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Omnivl: One foundation model for image-language and video-language tasks. Advances in neural information processing systems, 35:5696β5710, 2022a.
- Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768, 2020.
- Vatex: A large-scale, high-quality multilingual dataset for video-and-language research. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4581β4591, 2019.
- Internvideo: General video foundation models via generative and discriminative learning. arXiv preprint arXiv:2212.03191, 2022b.
- Language models with image descriptors are strong few-shot video-language learners. Advances in Neural Information Processing Systems, 35:8483β8497, 2022c.
- Towards long-form video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1884β1894, 2021.
- Memvit: Memory-augmented multiscale vision transformer for efficient long-term video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13587β13597, 2022.
- Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3733β3742, 2018.
- Videoclip: Contrastive pre-training for zero-shot video-text understanding. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6787β6800, 2021.
- Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5288β5296, 2016.
- Advancing high-resolution video-language representation with large-scale video transcriptions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5036β5045, 2022.
- Vltint: visual-linguistic transformer-in-transformer for coherent video paragraph captioning. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 3081β3090, 2023.
- Multiview transformers for video recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3333β3343, 2022a.
- Video-text modeling with zero-shot transfer from contrastive captioners. arXiv preprint arXiv:2212.04979, 2022b.
- Zero-shot video question answering via frozen bidirectional language models. Advances in Neural Information Processing Systems, 35:124β141, 2022.
- mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023.
- Self-chained image-language model for video localization and question answering. arXiv preprint arXiv:2305.06988, 2023.
- Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, 2021.
- Merlot: Multimodal neural script knowledge models. Advances in Neural Information Processing Systems, 34:23634β23651, 2021.
- Socratic models: Composing zero-shot multimodal reasoning with language. In The Eleventh International Conference on Learning Representations, 2022.
- Scaling vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12104β12113, 2022.
- Multimodal video adapter for parameter efficient video text retrieval. arXiv preprint arXiv:2301.07868, 2023a.
- Adaptive budget allocation for parameter-efficient fine-tuning. arXiv preprint arXiv:2303.10512, 2023b.
- Slow feature analysis for human action recognition. IEEE transactions on pattern analysis and machine intelligence, 34(3):436β450, 2012.
- Towards automatic learning of procedures from web instructional videos. In Proceedings of the AAAI Conference on Artificial Intelligence, 2018.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
- Pinelopi Papalampidi (10 papers)
- Skanda Koppula (23 papers)
- Shreya Pathak (12 papers)
- Justin Chiu (13 papers)
- Joe Heyward (2 papers)
- Jiajun Shen (35 papers)
- Antoine Miech (23 papers)
- Andrew Zisserman (248 papers)
- Aida Nematzdeh (1 paper)
- Viorica Patraucean (12 papers)