Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

VidLA: Video-Language Alignment at Scale (2403.14870v1)

Published 21 Mar 2024 in cs.CV, cs.CL, and cs.LG

Abstract: In this paper, we propose VidLA, an approach for video-language alignment at scale. There are two major limitations of previous video-language alignment approaches. First, they do not capture both short-range and long-range temporal dependencies and typically employ complex hierarchical deep network architectures that are hard to integrate with existing pretrained image-text foundation models. To effectively address this limitation, we instead keep the network architecture simple and use a set of data tokens that operate at different temporal resolutions in a hierarchical manner, accounting for the temporally hierarchical nature of videos. By employing a simple two-tower architecture, we are able to initialize our video-LLM with pretrained image-text foundation models, thereby boosting the final performance. Second, existing video-language alignment works struggle due to the lack of semantically aligned large-scale training data. To overcome it, we leverage recent LLMs to curate the largest video-language dataset to date with better visual grounding. Furthermore, unlike existing video-text datasets which only contain short clips, our dataset is enriched with video clips of varying durations to aid our temporally hierarchical data tokens in extracting better representations at varying temporal scales. Overall, empirical results show that our proposed approach surpasses state-of-the-art methods on multiple retrieval benchmarks, especially on longer videos, and performs competitively on classification benchmarks.

VidLA: Video-Language Alignment at Scale

The research paper on VidLA introduces a novel approach for video-language alignment, explicitly addressing the limitations of prior methods by leveraging the strengths of pre-trained image-text foundation models. The authors identify two significant challenges in the field: the intricacy of capturing temporal dependencies in video data and the scarcity of semantically aligned, large-scale video-language datasets.

Architectural Innovation

VidLA innovates by simplifying the network architecture through a two-tower model while employing data tokens across different temporal resolutions. This approach mirrors the hierarchical nature of video data, thus enabling the integration with pre-trained image-text models without the need to modify the model intricately. The hierarchical temporal attention mechanisms introduced here factorize the space-time attention into local and global components, emphasizing both fine-grained motion and overarching temporal relations. The use of multi-scale temporal tokens further enhances this hierarchical capturing of video semantics, distinguishing VidLA from previous methods that struggled with either overly localized or excessively aggregated spatio-temporal information.

Dataset Creation and Utilization

In confronting the scarcity of robust datasets, VidLA presents a substantial contribution in the form of a newly curated dataset comprising approximately 800 million video-text pairs. The dataset's innovation lies in its multi-scale video clip abstraction, curated using LLMs to ensure high semantic correlation between visual content and text. Distinguishing itself from existing datasets that predominantly feature short clips, VidLA's dataset includes video clips of varying durations. This variety is crucial in training models that reliably handle diverse temporal scales, thus addressing a common limitation in video-language alignment.

Empirical Results

The paper reports notable performance improvements over state-of-the-art methods across various retrieval and classification benchmarks. Specifically, VidLA's hierarchical attention mechanisms significantly enhance video-text retrieval performance, especially in longer sequences. The results underscore VidLA's effectiveness in both local and global temporal modeling, capitalizing on the pretrained foundation of image-text models. This efficiency enables VidLA to excel in retrieval tasks, presenting marked improvements in Recall@1 and other metrics across datasets such as MSR-VTT, DiDeMo, and ActivityNet Captions.

Implications and Future Directions

VidLA's contributions have profound implications for both theoretical and practical aspects of AI research. The proposed alignment architecture not only enhances video-language understanding but also stimulates future research directions involving even deeper integration of hierarchical models with foundation models to tackle other modalities. Further examinations could explore augmenting VidLA with additional context-aware mechanisms or diversifying its application to other language tasks, possibly extending beyond the vision-language paradigm.

Moreover, the dataset creation methodologies employed here can inform ongoing developments in automated data curation, suggesting a scalable avenue for generating rich datasets with low resource investment. The combination of technical innovation with a strategic approach to data curation positions VidLA as an impactful advancement in the video-language alignment domain. Future works might also assess the model's adaptability to new and unseen datasets, focusing on its zero-shot capabilities in both retrieval and classification contexts.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (105)
  1. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
  2. Deep speech 2: End-to-end speech recognition in english and mandarin. In International conference on machine learning, pages 173–182. PMLR, 2016.
  3. Localizing moments in video with natural language. In Proceedings of the IEEE international conference on computer vision, pages 5803–5812, 2017.
  4. Vivit: A video vision transformer. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6836–6846, 2021.
  5. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  6. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1728–1738, 2021.
  7. Is space-time attention all you need for video understanding? In ICML, volume 2, page 4, 2021.
  8. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  9. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, pages 190–200, 2011.
  10. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020.
  11. Pali: A jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794, 2022.
  12. Vindlu: A recipe for effective video-and-language pretraining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10739–10750, 2023.
  13. Improving video-text retrieval by multi-stream corpus alignment and dual softmax loss. arXiv preprint arXiv:2109.04290, 2021.
  14. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2023.
  15. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  16. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  17. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
  18. Redcaps: Web-curated image-text data created by the people, for the people. arXiv preprint arXiv:2111.11431, 2021.
  19. BERT: pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 4171–4186. Association for Computational Linguistics, 2019.
  20. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  21. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  22. Uatvr: Uncertainty-adaptive text-video retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 13723–13733, October 2023.
  23. Clip2video: Mastering video-text retrieval via image clip. arXiv preprint arXiv:2106.11097, 2021.
  24. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6202–6211, 2019.
  25. Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1933–1941, 2016.
  26. Violet: End-to-end video-language transformers with masked visual-token modeling. arXiv preprint arXiv:2111.12681, 2021.
  27. Clip2tv: Align, match and distill for video-text retrieval. arXiv preprint arXiv:2111.05610, 2021.
  28. Bridging video-text retrieval with multiple choice questions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16167–16176, 2022.
  29. X-pool: Cross-modal language-video attention for text-video retrieval. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5006–5015, 2022.
  30. The “something something” video database for learning and evaluating visual common sense. In ICCV, volume 1, page 5, 2017.
  31. Pidro: Parallel isomeric attention with dynamic routing for text-video retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 11164–11173, October 2023.
  32. Temporal alignment networks for long-term video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2906–2916, 2022.
  33. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR, 2019.
  34. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR, 2021.
  35. Diffusionret: Generative text-video retrieval with diffusion model. arXiv preprint arXiv:2303.09867, 2023.
  36. The kinetics human action video dataset, 2017.
  37. Dense-captioning events in videos. In Proceedings of the IEEE international conference on computer vision, pages 706–715, 2017.
  38. Revealing single frame bias for video-and-language learning. arXiv preprint arXiv:2206.03428, 2022.
  39. Less is more: Clipbert for video-and-language learning via sparse sampling. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7331–7341, 2021.
  40. mplug: Effective and efficient vision-language learning by cross-modal skip-connections. arXiv preprint arXiv:2205.12005, 2022.
  41. Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 11336–11344, 2020.
  42. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
  43. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR, 2022.
  44. Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems, 34:9694–9705, 2021.
  45. Uniformer: Unified transformer for efficient spatial-temporal representation learning. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022.
  46. Unmasked teacher: Towards training-efficient video foundation models. arXiv preprint arXiv:2303.16058, 2023.
  47. Oscar: Object-semantics aligned pre-training for vision-language tasks. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16, pages 121–137. Springer, 2020.
  48. Mvitv2: Improved multiscale vision transformers for classification and detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4804–4814, 2022.
  49. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023.
  50. Revisiting temporal modeling for clip-based image-to-video knowledge transferring. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6555–6564, 2023.
  51. Ts2-net: Token shift and selection transformer for text-video retrieval. In European Conference on Computer Vision, pages 319–335. Springer, 2022.
  52. Video swin transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3202–3211, 2022.
  53. Video swin transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3202–3211, June 2022.
  54. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  55. Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning. Neurocomputing, 508:293–304, 2022.
  56. End-to-end learning of visual representations from uncurated instructional videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9879–9889, 2020.
  57. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF international conference on computer vision, pages 2630–2640, 2019.
  58. Verbs in action: Improving verb understanding in video-language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15579–15591, 2023.
  59. Learning audio-video modalities from image captions. In European Conference on Computer Vision, pages 407–426. Springer, 2022.
  60. Video transformer network. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3163–3172, 2021.
  61. Grit: Faster and better image captioning transformer using dual visual features. In European Conference on Computer Vision, pages 167–184. Springer, 2022.
  62. Expanding language-image pretrained models for general video recognition. In European Conference on Computer Vision, pages 1–18. Springer, 2022.
  63. St-adapter: Parameter-efficient image-to-video transfer learning. Advances in Neural Information Processing Systems, 35:26462–26477, 2022.
  64. Support-set bottlenecks for video-text representation learning. arXiv preprint arXiv:2010.02824, 2020.
  65. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  66. Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning, pages 28492–28518. PMLR, 2023.
  67. Do vision transformers see like convolutional neural networks? Advances in Neural Information Processing Systems, 34:12116–12128, 2021.
  68. Hiera: A hierarchical vision transformer without the bells-and-whistles. arXiv preprint arXiv:2306.00989, 2023.
  69. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
  70. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
  71. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, 2018.
  72. Howtocaption: Prompting llms to transform video annotations at scale. arXiv preprint arXiv:2310.04900, 2023.
  73. Two-stream convolutional networks for action recognition in videos. Advances in neural information processing systems, 27, 2014.
  74. Flava: A foundational language and vision alignment model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15638–15650, 2022.
  75. Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023.
  76. Yfcc100m: The new data in multimedia research. Commun. ACM, 59(2):64–73, jan 2016.
  77. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Advances in neural information processing systems, 35:10078–10093, 2022.
  78. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  79. Omnivl: One foundation model for image-language and video-language tasks. Advances in neural information processing systems, 35:5696–5710, 2022.
  80. All in one: Exploring unified video-language pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6598–6608, 2023.
  81. Git: A generative image-to-text transformer for vision and language. arXiv preprint arXiv:2205.14100, 2022.
  82. Videomae v2: Scaling video masked autoencoders with dual masking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14549–14560, June 2023.
  83. Temporal segment networks for action recognition in videos. IEEE transactions on pattern analysis and machine intelligence, 41(11):2740–2755, 2018.
  84. Temporal segment networks for action recognition in videos. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(11):2740–2755, 2019.
  85. Actionclip: A new paradigm for video action recognition. arXiv preprint arXiv:2109.08472, 2021.
  86. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In International Conference on Machine Learning, pages 23318–23340. PMLR, 2022.
  87. Disentangled representation learning for text-video retrieval. arXiv preprint arXiv:2203.07111, 2022.
  88. Image as a foreign language: Beit pretraining for all vision and vision-language tasks. arXiv preprint arXiv:2208.10442, 2022.
  89. Vatex: A large-scale, high-quality multilingual dataset for video-and-language research. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4581–4591, 2019.
  90. Internvideo: General video foundation models via generative and discriminative learning. arXiv preprint arXiv:2212.03191, 2022.
  91. Cap4video: What can auxiliary captions do for text-video retrieval? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10704–10713, 2023.
  92. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5288–5296, 2016.
  93. Advancing high-resolution video-language representation with large-scale video transcriptions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5036–5045, 2022.
  94. Clip-vip: Adapting pre-trained image-text model to video-language representation alignment. arXiv preprint arXiv:2209.06430, 2022.
  95. Videococa: Video-text modeling with zero-shot transfer from contrastive captioners. arXiv preprint arXiv:2212.04979, 2022.
  96. Aim: Adapting image models for efficient video action recognition. arXiv preprint arXiv:2302.03024, 2023.
  97. Filip: Fine-grained interactive language-image pre-training. arXiv preprint arXiv:2111.07783, 2021.
  98. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023.
  99. Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022.
  100. Merlot reserve: Neural script knowledge through vision and language and sound. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16375–16387, 2022.
  101. Merlot: Multimodal neural script knowledge models. Advances in Neural Information Processing Systems, 34:23634–23651, 2021.
  102. Pushing the limits of semi-supervised learning for automatic speech recognition. arXiv preprint arXiv:2010.10504, 2020.
  103. Centerclip: Token clustering for efficient text-video retrieval. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 970–981, 2022.
  104. Temporal relational reasoning in videos. In Proceedings of the European conference on computer vision (ECCV), pages 803–818, 2018.
  105. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Mamshad Nayeem Rizve (17 papers)
  2. Fan Fei (24 papers)
  3. Jayakrishnan Unnikrishnan (12 papers)
  4. Son Tran (22 papers)
  5. Benjamin Z. Yao (2 papers)
  6. Belinda Zeng (16 papers)
  7. Mubarak Shah (207 papers)
  8. Trishul Chilimbi (22 papers)
Citations (3)