Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames (2312.07395v1)

Published 12 Dec 2023 in cs.CV and cs.CL

Abstract: Understanding long, real-world videos requires modeling of long-range visual dependencies. To this end, we explore video-first architectures, building on the common paradigm of transferring large-scale, image--text models to video via shallow temporal fusion. However, we expose two limitations to the approach: (1) decreased spatial capabilities, likely due to poor video--language alignment in standard video datasets, and (2) higher memory consumption, bottlenecking the number of frames that can be processed. To mitigate the memory bottleneck, we systematically analyze the memory/accuracy trade-off of various efficient methods: factorized attention, parameter-efficient image-to-video adaptation, input masking, and multi-resolution patchification. Surprisingly, simply masking large portions of the video (up to 75%) during contrastive pre-training proves to be one of the most robust ways to scale encoders to videos up to 4.3 minutes at 1 FPS. Our simple approach for training long video-to-text models, which scales to 1B parameters, does not add new architectural complexity and is able to outperform the popular paradigm of using much larger LLMs as an information aggregator over segment-based information on benchmarks with long-range temporal dependencies (YouCook2, EgoSchema).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (87)
  1. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
  2. Vivit: A video vision transformer. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6836–6846, 2021.
  3. Is space-time attention all you need for video understanding? In ICML, pageΒ 4, 2021a.
  4. Is space-time attention all you need for video understanding? In ICML, pageΒ 4, 2021b.
  5. Revisiting the” video” in video-language understanding. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2917–2927, 2022.
  6. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the ieee conference on computer vision and pattern recognition, pages 961–970, 2015.
  7. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017.
  8. Litevl: Efficient video-language learning with enhanced spatial-temporal modeling. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 7985–7997, 2022.
  9. Parameter-efficient fine-tuning design spaces. arXiv preprint arXiv:2301.01821, 2023a.
  10. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
  11. Pali-3 vision language models: Smaller, faster, stronger. arXiv preprint arXiv:2310.09199, 2023b.
  12. Vindlu: A recipe for effective video-and-language pretraining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10739–10750, 2023.
  13. Spatiotemporal residual networks for video action recognition. Advances in neural information processing systems, 2, 2016.
  14. A thousand frames in just a few words: Lingual description of videos through latent topics and sparse object stitching. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2634–2641, 2013.
  15. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2020.
  16. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023.
  17. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6202–6211, 2019.
  18. Violet: End-to-end video-language transformers with masked visual-token modeling. arXiv preprint arXiv:2111.12681, 2021.
  19. An empirical study of end-to-end video-language transformers with masked visual modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22898–22909, 2023.
  20. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 297–304. JMLR Workshop and Conference Proceedings, 2010.
  21. Turbo training with token dropout. arXiv preprint arXiv:2210.04889, 2022.
  22. Masked autoencoders are scalable vision learners. 2022 ieee. In CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15979–15988, 2021.
  23. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
  24. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR, 2019.
  25. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2021.
  26. Language is not all you need: Aligning perception with language models. arXiv preprint arXiv:2302.14045, 2023.
  27. Long movie clip classification with state-space video models. In European Conference on Computer Vision, pages 87–104. Springer, 2022.
  28. Perceiver: General perception with iterative attention. In International conference on machine learning, pages 4651–4664. PMLR, 2021.
  29. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, pages 4904–4916. PMLR, 2021.
  30. Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451, 2020.
  31. Dense-captioning events in videos. In Proceedings of the IEEE international conference on computer vision, pages 706–715, 2017.
  32. Revealing single frame bias for video-and-language learning. arXiv preprint arXiv:2206.03428, 2022.
  33. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023a.
  34. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355, 2023b.
  35. Lavender: Unifying video-language understanding as masked language modeling. arXiv preprint arXiv:2206.07160, 2022.
  36. Lavender: Unifying video-language understanding as masked language modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23119–23129, 2023c.
  37. Mm-vid: Advancing video understanding with gpt-4v(ision), 2023a.
  38. Smaug: Sparse masked autoencoder for efficient video-language pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2459–2469, 2023b.
  39. Revisiting temporal modeling for clip-based image-to-video knowledge transferring. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6555–6564, 2023.
  40. Video swin transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3202–3211, 2022.
  41. Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning. Neurocomputing, 508:293–304, 2022.
  42. Rethinking resolution in the context of efficient video recognition. Advances in Neural Information Processing Systems, 35:37865–37877, 2022a.
  43. Simvtp: Simple video text pre-training with masked autoencoders. arXiv preprint arXiv:2212.03490, 2022b.
  44. Egoschema: A diagnostic benchmark for very long-form video language understanding. arXiv preprint arXiv:2308.09126, 2023.
  45. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2630–2640, 2019.
  46. End-to-end learning of visual representations from uncurated instructional videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9879–9889, 2020.
  47. Learning audio-video modalities from image captions. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XIV, pages 407–426. Springer, 2022.
  48. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  49. St-adapter: Parameter-efficient image-to-video transfer learning. Advances in Neural Information Processing Systems, 35:26462–26477, 2022.
  50. Perception test: A diagnostic benchmark for multimodal video models. arXiv preprint arXiv:2305.13786, 2023.
  51. Yarn: Efficient context window extension of large language models. arXiv preprint arXiv:2309.00071, 2023.
  52. Rethinking video vits: Sparse video tubes for joint image and video learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2214–2224, 2023.
  53. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  54. Token turing machines. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19070–19081, 2023.
  55. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, 2018.
  56. Two-stream convolutional networks for action recognition in videos. Advances in neural information processing systems, 27, 2014.
  57. Flava: A foundational language and vision alignment model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15638–15650, 2022.
  58. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. In Advances in Neural Information Processing Systems, 2022.
  59. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 6450–6459, 2018.
  60. Image captioners are scalable vision learners too. arXiv preprint arXiv:2306.07915, 2023.
  61. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  62. Omnivl: One foundation model for image-language and video-language tasks. Advances in neural information processing systems, 35:5696–5710, 2022a.
  63. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768, 2020.
  64. Vatex: A large-scale, high-quality multilingual dataset for video-and-language research. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4581–4591, 2019.
  65. Internvideo: General video foundation models via generative and discriminative learning. arXiv preprint arXiv:2212.03191, 2022b.
  66. Language models with image descriptors are strong few-shot video-language learners. Advances in Neural Information Processing Systems, 35:8483–8497, 2022c.
  67. Towards long-form video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1884–1894, 2021.
  68. Memvit: Memory-augmented multiscale vision transformer for efficient long-term video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13587–13597, 2022.
  69. Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3733–3742, 2018.
  70. Videoclip: Contrastive pre-training for zero-shot video-text understanding. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6787–6800, 2021.
  71. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5288–5296, 2016.
  72. Advancing high-resolution video-language representation with large-scale video transcriptions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5036–5045, 2022.
  73. Vltint: visual-linguistic transformer-in-transformer for coherent video paragraph captioning. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 3081–3090, 2023.
  74. Multiview transformers for video recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3333–3343, 2022a.
  75. Video-text modeling with zero-shot transfer from contrastive captioners. arXiv preprint arXiv:2212.04979, 2022b.
  76. Zero-shot video question answering via frozen bidirectional language models. Advances in Neural Information Processing Systems, 35:124–141, 2022.
  77. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023.
  78. Self-chained image-language model for video localization and question answering. arXiv preprint arXiv:2305.06988, 2023.
  79. Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, 2021.
  80. Merlot: Multimodal neural script knowledge models. Advances in Neural Information Processing Systems, 34:23634–23651, 2021.
  81. Socratic models: Composing zero-shot multimodal reasoning with language. In The Eleventh International Conference on Learning Representations, 2022.
  82. Scaling vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12104–12113, 2022.
  83. Multimodal video adapter for parameter efficient video text retrieval. arXiv preprint arXiv:2301.07868, 2023a.
  84. Adaptive budget allocation for parameter-efficient fine-tuning. arXiv preprint arXiv:2303.10512, 2023b.
  85. Slow feature analysis for human action recognition. IEEE transactions on pattern analysis and machine intelligence, 34(3):436–450, 2012.
  86. Towards automatic learning of procedures from web instructional videos. In Proceedings of the AAAI Conference on Artificial Intelligence, 2018.
  87. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Pinelopi Papalampidi (10 papers)
  2. Skanda Koppula (23 papers)
  3. Shreya Pathak (12 papers)
  4. Justin Chiu (13 papers)
  5. Joe Heyward (2 papers)
  6. Jiajun Shen (35 papers)
  7. Antoine Miech (23 papers)
  8. Andrew Zisserman (248 papers)
  9. Aida Nematzdeh (1 paper)
  10. Viorica Patraucean (12 papers)
Citations (17)