Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
166 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Foundation Models for Video Understanding: A Survey (2405.03770v1)

Published 6 May 2024 in cs.CV

Abstract: Video Foundation Models (ViFMs) aim to learn a general-purpose representation for various video understanding tasks. Leveraging large-scale datasets and powerful models, ViFMs achieve this by capturing robust and generic features from video data. This survey analyzes over 200 video foundational models, offering a comprehensive overview of benchmarks and evaluation metrics across 14 distinct video tasks categorized into 3 main categories. Additionally, we offer an in-depth performance analysis of these models for the 6 most common video tasks. We categorize ViFMs into three categories: 1) Image-based ViFMs, which adapt existing image models for video tasks, 2) Video-Based ViFMs, which utilize video-specific encoding methods, and 3) Universal Foundational Models (UFMs), which combine multiple modalities (image, video, audio, and text etc.) within a single framework. By comparing the performance of various ViFMs on different tasks, this survey offers valuable insights into their strengths and weaknesses, guiding future advancements in video understanding. Our analysis surprisingly reveals that image-based foundation models consistently outperform video-based models on most video understanding tasks. Additionally, UFMs, which leverage diverse modalities, demonstrate superior performance on video tasks. We share the comprehensive list of ViFMs studied in this work at: \url{https://github.com/NeeluMadan/ViFM_Survey.git}

Definition Search Book Streamline Icon: https://streamlinehq.com
References (361)
  1. EZ-CLIP: Efficient Zeroshot Video Action Recognition. arXiv:2312.08010 (2023).
  2. Alternating gradient descent and mixture-of-experts for integrated multimodal perception. In Proceedings of NeurIPS.
  3. Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text. In Proceedings of NeurIPS. 24206–24221.
  4. Flamingo: a visual language model for few-shot learning. In Proceedings of NeurIPS. 23716–23736.
  5. Palm 2 technical report. arXiv:2305.10403 (2023).
  6. Localizing moments in video with natural language. In Proceedings of ICCV. 5803–5812.
  7. Vivit: A video vision transformer. In Proceedings of ICCV. 6836–6846.
  8. HierVL: Learning Hierarchical Video-Language Embeddings. In Proceedings of CVPR. 23066–23078.
  9. Burst: A benchmark for unifying object recognition, segmentation and tracking in video. In Proceedings of WACV. 1674–1683.
  10. Foundational models defining a new era in vision: A survey and outlook. arXiv:2307.13721 (2023).
  11. Do Video-Language Foundation Models have a Sense of Time?. In Proceedings of ICLR-Workshop on Mathematical and Empirical Understanding of Foundation Models.
  12. Condensed movies: Story based retrieval with contextual embeddings. In Proceedings of ACCV.
  13. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of CVPR. 1728–1738.
  14. Memory Consolidation Enables Long-Context Video Understanding. arXiv:2402.05861 (2024).
  15. Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL-W. 65–72.
  16. Lumiere: A space-time diffusion model for video generation. arXiv:2401.12945 (2024).
  17. The ikea asm dataset: Understanding people assembling furniture through actions, objects and pose. In Proceedings of WACV. 847–859.
  18. Keni Bernardin and Rainer Stiefelhagen. 2008. Evaluating multiple object tracking performance: the clear mot metrics. EURASIP Journal on Image and Video Processing 2008 (2008), 1–10.
  19. Is space-time attention all you need for video understanding?. In Proceedings of ICML.
  20. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv:2311.15127 (2023).
  21. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of CVPR. 22563–22575.
  22. Cross modal retrieval with querybank normalisation. In Proceedings of CVPR. 5194–5205.
  23. On the opportunities and risks of foundation models. arXiv (2021).
  24. Video generation models as world simulators. https://openai.com/sora.
  25. Revisiting the” video” in video-language understanding. In Proceedings of CVPR. 2917–2927.
  26. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of CVPR. 961–970.
  27. A short note about kinetics-600. arXiv preprint arXiv:1808.01340 (2018).
  28. A short note on the kinetics-700 human action dataset. arXiv preprint arXiv:1907.06987 (2019).
  29. Joao Carreira and Andrew Zisserman. 2017a. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of CVPR. 6299–6308.
  30. Joao Carreira and Andrew Zisserman. 2017b. Quo vadis, action recognition? a new model and the kinetics dataset. In Proceedings of CVPR. 6299–6308.
  31. In-the-Wild Video Question Answering. In Proceedings of International Conference on Computational Linguistics (ICCL). 5613–5635.
  32. Santiago Castro and Fabian Caba Heilbron. 2022. FitCLIP: Refining Large-Scale Pretrained Image-Text Models for Zero-Shot Video Understanding Tasks. arXiv (2022).
  33. FIBER: Fill-in-the-Blanks as a Challenging Video Understanding Evaluation Framework. In Proceedings of ACL. 2925–2940.
  34. Stablevideo: Text-driven consistency-aware diffusion video editing. In Proceedings of ICCV. 23040–23050.
  35. Event-centric multi-modal fusion method for dense video captioning. Neural Networks 146 (2022), 120–129.
  36. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of CVPR. 3558–3568.
  37. Opening the vocabulary of egocentric actions. In Proceedings of NeurIPS.
  38. David Chen and William B Dolan. 2011. Collecting highly parallel data for paraphrase evaluation. In Proceedings of ACL:HLT. 190–200.
  39. X-llm: Bootstrapping advanced large language models by treating multi-modalities as foreign languages. arXiv:2305.04160 (2023).
  40. Tem-adapter: Adapting Image-Text Pretraining for Video Question Answer. In Proceedings of ICCV. 13945–13955.
  41. Vggsound: A large-scale audio-visual dataset. In Proceedings of ICASSP. IEEE, 721–725.
  42. VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset. arXiv (2023).
  43. COSA: Concatenated Sample Pretrained Vision-Language Foundation Model. arXiv (2023).
  44. VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset. arXiv (2023).
  45. BEATs: Audio Pre-Training with Acoustic Tokenizers. In Proceedings of ICML. 5178–5193.
  46. Simclr: A simple framework for contrastive learning of visual representations. In Proceedings of ICLR.
  47. Motion-conditioned diffusion model for controllable video synthesis. arXiv:2304.14404 (2023).
  48. Control-a-video: Controllable text-to-video generation with diffusion models. arXiv:2305.13840 (2023).
  49. Pali-3 vision language models: Smaller, faster, stronger. arXiv:2310.09199 (2023).
  50. Pali: A jointly-scaled multilingual language-image model. arXiv:2209.06794 (2022).
  51. Vindlu: A recipe for effective video-and-language pretraining. In Proceedings of CVPR. 10739–10750.
  52. Tracking anything with decoupled video segmentation. In Proceedings of ICCV. 1316–1326.
  53. Improving video-text retrieval by multi-stream corpus alignment and dual softmax loss. arXiv:2109.04290 (2021).
  54. Segment and Track Anything. arXiv (2023).
  55. Palm: Scaling language modeling with pathways. JMLR 24, 240 (2023), 1–113.
  56. Together Computer. 2023. RedPajama: an Open Dataset for Training Large Language Models. https://github.com/togethercomputer/RedPajama-Data
  57. InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning. arXiv preprint arXiv:2305.06500 (2023).
  58. Fine-grained Text-Video Retrieval with Frozen Image Encoders. arXiv preprint arXiv:2307.09972 (2023).
  59. Scaling egocentric vision: The epic-kitchens dataset. In Proceedings of ECCV. 720–736.
  60. Tao: A large-scale benchmark for tracking any object. In Proceedings of ECCV. 436–454.
  61. Tinyvirat: Low-resolution video action recognition. In Proceedings of ICPR. 7387–7394.
  62. MOT20: A benchmark for multi object tracking in crowded scenes. ArXiv (2020).
  63. Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval. In Proceedings of ICCV. 15648–15658.
  64. Sketch, ground, and refine: Top-down dense video captioning. In Proceedings of CVPR. 234–243.
  65. Imagenet: A large-scale hierarchical image database. In Proceedings of CVPR. 248–255.
  66. Tap-vid: A benchmark for tracking any point in a video. In Proceedings of NeurIPS. 13610–13626.
  67. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
  68. Videocapsulenet: A simplified network for action detection. In Proceedings of NeurIPS, Vol. 31.
  69. Capsulevos: Semi-supervised video object segmentation using capsule routing. In Proceedings of the IEEE/CVF international conference on computer vision. 8480–8489.
  70. Deniz Engin and Yannis Avrithis. 2023. Zero-Shot and Few-Shot Video Question Answering with Multi-Modal Prompts. In Proceedings of ICCV. 2804–2810.
  71. Taming transformers for high-resolution image synthesis. In Proceedings of CVPR. 12873–12883.
  72. MOTSynth: How Can Synthetic Data Help Pedestrian Detection and Tracking?. In Proceedings of ICCV. 10849–10859.
  73. Lasot: A high-quality benchmark for large-scale single object tracking. In Proceedings of CVPR. 5374–5383.
  74. Clip2video: Mastering video-text retrieval via image clip. arXiv:2106.11097 (2021).
  75. Transferring image-clip to video-text retrieval via temporal relations. IEEE Transactions on Multimedia (2022).
  76. Alignment and Generation Adapter for Efficient Video-Text Understanding. In Proceedings of ICCV. 2791–2797.
  77. Eva-02: A visual representation for neon genesis. arXiv:2303.11331 (2023).
  78. Masked autoencoders as spatiotemporal learners. In Proceedings of NeurIPS, Vol. 35. 35946–35958.
  79. Convolutional two-stream network fusion for video action recognition. In Proceedings of CVPR. 1933–1941.
  80. VIOLET: End-to-End Video-Language Transformers with Masked Visual-token Modeling. arXiv (2022).
  81. An empirical study of end-to-end video-language transformers with masked visual modeling. In Proceedings of CVPR. 22898–22909.
  82. Tell me what happened: Unifying text-guided video completion via multimodal masked video generation. In Proceedings of CVPR. 10681–10692.
  83. Actor and action video segmentation from a sentence. In Proceedings of CVPR. 5958–5966.
  84. Preserve your own correlation: A noise prior for video diffusion models. In Proceedings of ICCV. 22930–22941.
  85. Bridging video-text retrieval with multiple choice questions. In Proceedings of CVPR. 16167–16176.
  86. Miles: Visual bert pre-training with injected language semantics for video-text retrieval. In Proceedings of ECCV. 691–708.
  87. Vision meets robotics: The KITTI dataset. International Journal of Robotics Research (IJRR) 32 (2013), 1231 – 1237.
  88. Audio set: An ontology and human-labeled dataset for audio events. In Proceedings of ICASSP. 776–780.
  89. Audiovisual masked autoencoders. In Proceedings of ICCV. 16144–16154.
  90. Omnimae: Single model masked pretraining on images and videos. In Proceedings of CVPR. 10406–10417.
  91. Omnivore: A single model for many visual modalities. In Proceedings of CVPR. 16102–16112.
  92. Ast: Audio spectrogram transformer. arXiv preprint arXiv:2104.01778 (2021).
  93. Contrastive Audio-Visual Masked Autoencoder. arXiv (2023).
  94. X-pool: Cross-modal language-video attention for text-video retrieval. In Proceedings of CVPR. 5006–5015.
  95. The” something something” video database for learning and evaluating visual common sense. In Proceedings of the ICCV. 5842–5850.
  96. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of CVPR. 18995–19012.
  97. Bootstrap your own latent – a new approach to self-supervised learning. In Proceedings of NeurIPS. 21271–21284.
  98. Ava: A video dataset of spatio-temporally localized atomic visual actions. In Proceedings of CVPR. 6047–6056.
  99. Non-autoregressive neural machine translation. arXiv:1711.02281 (2017).
  100. Text with knowledge graph augmented transformer for video captioning. In Proceedings of CVPR. 18941–18951.
  101. Lvis: A dataset for large vocabulary instance segmentation. In Proceedings of CVPR. 5356–5364.
  102. Imagine this! scripts to compositions to videos. In Proceedings of ECCV. 598–613.
  103. Mugen: A playground for video-audio-text multimodal understanding and generation. In Proceedings of ECCV. 431–449.
  104. Masked autoencoders are scalable vision learners. In Proceedings of CVPR. 16000–16009.
  105. Deep residual learning for image recognition. IEEE TPAMI 37, 3 (2016), 630–643.
  106. Visual Representation Learning from Unlabeled Video using Contrastive Masked Autoencoders. arXiv (2023).
  107. Imagen video: High definition video generation with diffusion models. arXiv:2210.02303 (2022).
  108. Denoising diffusion probabilistic models. In Proceedings of NeurIPS. 6840–6851.
  109. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv:2205.15868 (2022).
  110. Attention-based multimodal fusion for video description. In Proceedings of ICCV. 4193–4202.
  111. Parameter-efficient transfer learning for NLP. In Proceedings of ICML. 2790–2799.
  112. AdaCLIP: Towards Pragmatic Multimodal Video Retrieval. In Proceedings of ACM-MM. 5623–5633.
  113. MGMAE: Motion Guided Masking for Video Masked Autoencoding. In Proceedings of ICCV. 13493–13504.
  114. Clover: Towards a unified video-language alignment and fusion model. In Proceedings of CVPR. 14856–14866.
  115. VoP: Text-Video Co-operative Prompt Tuning for Cross-Modal Retrieval. In Proceedings of CVPR. 6565–6574.
  116. FROSTER: Frozen CLIP is A Strong Teacher for Open-Vocabulary Action Recognition. In Proceedings of ICLR.
  117. Efficient Video Representation Learning via Motion-Aware Token Selection. arXiv (2023).
  118. The thumos challenge on action recognition for videos “in the wild”. CVIU 155 (2017), 1–23.
  119. OpenCLIP.
  120. Tgif-qa: Toward spatio-temporal reasoning in visual question answering. In Proceedings of CVPR. 2758–2766.
  121. Towards understanding action recognition. In Proceedings of ICCV. 3192–3199.
  122. Scaling up visual and vision-language representation learning with noisy text supervision. In Proceedings of ICML. 4904–4916.
  123. Scaling up visual and vision-language representation learning with noisy text supervision. In Proceeedings of ICML. 4904–4916.
  124. Bootstrapping Vision-Language Learning with Decoupled Language Pre-training. arXiv:2307.07063 (2023).
  125. Bootstrapping Vision-Language Learning with Decoupled Language Pre-training. arXiv (2023).
  126. Cross-Modal Adapter for Text-Video Retrieval. arXiv (2022).
  127. Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding. arXiv (2023).
  128. Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization. arXiv:2402.03161 (2024).
  129. Prompting visual-language models for efficient video understanding. In Proceedings of ECCV. 105–124.
  130. Prompting Visual-Language Models for Efficient Video Understanding. In Proceedings of ECCV. 105–124.
  131. Seong-Min Kang and Yoon-Sik Cho. 2023. MEME: Multi-Encoder Multi-Expert Framework with Data Augmentation for Video Retrieval. In Proceedings of SIGIR. 475–484.
  132. We’re Not Using Videos Effectively: An Updated Domain Adaptive Video Segmentation Baseline. arXiv:2402.00868 (2024).
  133. Large-scale video classification with convolutional neural networks. In Proceedings of CVPR. 1725–1732.
  134. Transformers are rnns: Fast autoregressive transformers with linear attention. In Proceedings of ICML. 5156–5165.
  135. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017).
  136. Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of NAACL-HLT. 4171–4186.
  137. Video object segmentation with language referring expressions. In Proceedings of ACCV. 123–141.
  138. Segment Anything. In Proceedings of ICCV. 4015–4026.
  139. MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models. In Proceedings of CVPR. 20105–20115.
  140. Videopoet: A large language model for zero-shot video generation. arXiv:2312.14125 (2023).
  141. Dense-captioning events in videos. In Proceedings of ICCV. 706–715.
  142. Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV 123 (2017), 32–73.
  143. The language of actions: Recovering the syntax and semantics of goal-directed human activities. In Proceedings of CVPR. 780–787.
  144. HMDB: a large video database for human motion recognition. In Proceedings of ICCV. 2556–2563.
  145. Benchmarking self-supervised video representation learning. arXiv:2306.06010 (2023).
  146. MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks. Transactions on Machine Learning Research (2023).
  147. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. IJCV 128, 7 (2020), 1956–1981.
  148. Revealing Single Frame Bias for Video-and-Language Learning. arXiv (2022).
  149. Less is more: Clipbert for video-and-language learning via sparse sampling. In Proceedings of CVPR. 7331–7341.
  150. The Power of Scale for Parameter-Efficient Prompt Tuning. In Proceedings of EMNLP. 3045–3059.
  151. The AVA-kinetics localized human actions video dataset (2020). arXiv:2005.00214 (2005).
  152. Align and prompt: Video-and-language pre-training with entity prompts. In Proceedings of CVPR. 4953–4963.
  153. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv:2301.12597 (2023).
  154. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceedings of ICML. 12888–12900.
  155. From representation to reasoning: Towards both evidence and commonsense reasoning for video question-answering. In Proceedings of CVPR. 21273–21282.
  156. VideoChat: Chat-Centric Video Understanding. arXiv (2023).
  157. Uniformerv2: Spatiotemporal learning by arming image vits with video uniformer. arXiv preprint arXiv:2211.09552 (2022).
  158. Unmasked Teacher: Towards Training-Efficient Video Foundation Models. In Proceedings of ICCV.
  159. Hero: Hierarchical encoder for video+ language omni-representation pre-training. arXiv:2005.00200 (2020).
  160. Lavender: Unifying video-language understanding as masked language modeling. In Proceedings of CVPR. 23119–23129.
  161. Progressive spatio-temporal prototype matching for text-video retrieval. In Proceedings of ICCV. 4100–4110.
  162. Xinhao Li and Limin Wang. 2023. ZeroI2V: Zero-Cost Adaptation of Pre-trained Transformers from Image to Video. arXiv (2023).
  163. Harvest Video Foundation Models via Efficient Post-Pretraining. arXiv (2023).
  164. Resound: Towards action recognition without representation bias. In Proceedings of ECCV. 513–528.
  165. Mvitv2: Improved multiscale vision transformers for classification and detection. In Proceedings of CVPR. 4804–4814.
  166. RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation. arXiv (2023).
  167. Video-LLaVA: Learning United Visual Representation by Alignment Before Projection. arXiv (2023).
  168. Text-adaptive multiple visual prototype matching for video-text retrieval. In Proceedings of NeurIPS. 38655–38666.
  169. Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Proceedings of ACL-W. 74–81.
  170. VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning. arXiv (2023).
  171. VILA: On Pre-training for Visual Language Models. arXiv:2312.07533 (2023).
  172. MM-VID: Advancing Video Understanding with GPT-4V(ision). arXiv (2023).
  173. Swinbert: End-to-end transformers with sparse attention for video captioning. In Proceedings of CVPR. 17949–17958.
  174. Microsoft coco: Common objects in context. In Proceedings of ECCV. 740–755.
  175. Match, expand and improve: Unsupervised finetuning for zero-shot action recognition with language knowledge. In Proceedings of ICCV. 2851–2862.
  176. Smaug: Sparse masked autoencoder for efficient video-language pre-training. In Proceedings of CVPR. 2459–2469.
  177. Frozen CLIP Models are Efficient Video Learners. In Proceedings of ECCV. 388–404.
  178. Improved baselines with visual instruction tuning. arXiv:2310.03744 (2023).
  179. Visual instruction tuning. In Proceedings of NeurIPS.
  180. World Model on Million-Length Video And Language With RingAttention. arXiv:2402.08268 (2024).
  181. VIOLIN: A large-scale dataset for video-and-language inference. In 2020 IEEE. In Proceedings of CVPR. 13–19.
  182. Objects do not disappear: Video object detection by single-frame object location anticipation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6950–6961.
  183. Roberta: A robustly optimized bert pretraining approach. arXiv:1907.11692 (2019).
  184. Fineaction: A fine-grained video dataset for temporal action localization. IEEE TIP 31 (2022), 6937–6950.
  185. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of ICCV. 10012–10022.
  186. Video swin transformer. In Proceedings of CVPR. 3202–3211.
  187. Hota: A higher order metric for evaluating multi-object tracking. IJCV 129 (2021), 548–578.
  188. UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation. arXiv (2020).
  189. Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning. Neurocomputing 508 (2022), 293–304.
  190. Valley: Video Assistant with Large Language model Enhanced abilitY. arXiv (2023).
  191. Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration. arXiv (2023).
  192. SimVTP: Simple Video Text Pre-training with Masked Autoencoders. arXiv (2022).
  193. Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models. arXiv (2023).
  194. CL-MAE: Curriculum-Learned Masked Autoencoders. In Proceedings of WACV. 2492–2502.
  195. Egoschema: A diagnostic benchmark for very long-form video language understanding. In Proceedings of NeurIPS.
  196. Docvqa: A dataset for vqa on document images. In Proceedings of the IEEE/CVF winter conference on applications of computer vision. 2200–2209.
  197. End-to-end learning of visual representations from uncurated instructional videos. In Proceedings of CVPR. 9879–9889.
  198. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of ICCV. 2630–2640.
  199. MOT16: A benchmark for multi-object tracking. arXiv (2016).
  200. NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. In Proceedings of ECCV. 405–421.
  201. Video action detection: Analysing limitations and challenges. arXiv:2204.07892 (2022).
  202. On Occlusions in Video Action Detection: Benchmark Datasets And Training Recipes. Proceedings of NeurIPS (2024).
  203. Clipcap: Clip prefix for image captioning. arXiv:2111.09734 (2021).
  204. Dreamix: Video diffusion models are general video editors. arXiv preprint arXiv:2302.01329 (2023).
  205. Verbs in action: Improving verb understanding in video-language models. In Proceedings of ICCV. 15579–15591.
  206. Moments in time dataset: one million videos for event understanding. IEEE TPAMI 42, 2 (2019), 502–508.
  207. Spoken moments: Learning joint audio-visual representations from video descriptions. In Proceedings of CVPR. 14871–14881.
  208. PG-Video-LLaVA: Pixel Grounding Large Video-Language Models. arXiv preprint arXiv:2311.13435 (2023).
  209. Modeling context between objects for referring expression understanding. In Proceedings of ECCV. 792–807.
  210. Learning audio-video modalities from image captions. In Proceedings of ECCV. 407–426.
  211. Expanding Language-Image Pretrained Models for General Video Recognition. In Proceedings of ECCV. 1–18.
  212. Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models. arXiv (2023).
  213. Youtubevis dataset 2021 version. https://youtube-vos.org/dataset/vis/.
  214. Queryd: A video dataset with high-quality text and audio narrations. In Proceedings of ICASSP. 2265–2269.
  215. OpenAI. 2023a. ChatGPT [Large language model]. https://chat.openai.com/chat Accessed: 2024-03-28.
  216. OpenAI. 2023b. GPT-4 Technical Report. arXiv:2303.08774 (2023).
  217. Im2Text: Describing Images Using 1 Million Captioned Photographs. In Proceedings of NeurIPS, J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K.Q. Weinberger (Eds.).
  218. Im2text: Describing images using 1 million captioned photographs. In Proceedings of NeurIPS.
  219. St-adapter: Parameter-efficient image-to-video transfer learning. Proceedings of NeurIPS (2022), 26462–26477.
  220. Videomoco: Contrastive video representation learning with temporally adversarial examples. In Proceedings of CVPR. 11205–11214.
  221. Bleu: a method for automatic evaluation of machine translation. In Proceedings of ACL. 311–318.
  222. Dual-path Adaptation from Image to Video Transformers. In Proceedings of CVPR. 2203–2213.
  223. Self-supervised video pretraining yields robust and more human-aligned visual representations. In Proceedings of NeurIPS.
  224. Bevsegformer: Bird’s eye view semantic segmentation from arbitrary camera rigs. In Proceedings of WACV. 5935–5943.
  225. Kosmos-2: Grounding Multimodal Large Language Models to the World. arXiv:2306.14824 (2023).
  226. A benchmark dataset and evaluation methodology for video object segmentation. In Proceedings of CVPR. 724–732.
  227. Amazon sagemaker automatic model tuning: Scalable gradient-free optimization. In Proceedings of SIGKDD. 3463–3471.
  228. ZEETAD: Adapting Pretrained Vision-Language Model for Zero-Shot End-to-End Temporal Action Detection. arXiv (2023).
  229. Jonah Philion and Sanja Fidler. 2020. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In Proceedings of ECCV. 194–210.
  230. Fatezero: Fusing attentions for zero-shot text-based video editing. In Proceedings of ICCV. 15932–15942.
  231. Occluded video instance segmentation: A benchmark. IJCV 130, 8 (2022), 2022–2039.
  232. Multimodal open-vocabulary video classification via pre-trained vision and language models. arXiv:2207.07646 (2022).
  233. Disentangling Spatial and Temporal Learning for Efficient Image-to-Video Transfer Learning. In Proceedings of ICCV. 13934–13944.
  234. Learning transferable visual models from natural language supervision. In Proceedings of ICML. 8748–8763.
  235. Language Models are Unsupervised Multitask Learners. OpenAI blog (2019).
  236. Segment Anything Meets Point Tracking. arXiv (2023).
  237. Zero-shot text-to-image generation. In Proceedings of ICML. 8821–8831.
  238. Kanchana Ranasinghe and Michael S Ryoo. 2023. Language-based action concept spaces improve video self-supervised learning. In Proceedings of NeurIPS.
  239. Vision transformers for dense prediction. In Proceedings of ICCV. 12179–12188.
  240. Fine-tuned CLIP Models are Efficient Video Learners. In Proceedings of CVPR. 6545–6554.
  241. A dataset for movie description. In Proceedings of CVPR. 3202–3212.
  242. David Romero and Thamar Solorio. 2024. Question-Instructed Visual Descriptions for Zero-Shot Video Question Answering. arXiv:2402.10698 (2024).
  243. Sitapa Rujikietgumjorn and Nattachai Watcharapinchai. 2017. Vehicle detection with sub-class training using R-CNN for the UA-DETRAC benchmark. Proceedings of AVSS (2017), 1–5.
  244. Imagenet large scale visual recognition challenge. International journal of computer vision 115 (2015), 211–252.
  245. Train sparsely, generate densely: Memory-efficient unsupervised training of high-resolution temporal gan. IJCV 128, 10-11 (2020), 2586–2606.
  246. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019).
  247. A large-scale robustness analysis of video action recognition models. In Proceedings of CVPR. 14698–14708.
  248. Self-supervised learning for videos: A survey. Comput. Surveys 55, 13s (2023), 1–37.
  249. Laion-5b: An open large-scale dataset for training next generation image-text models. In Proceedings of NeurIPS, Vol. 35. 25278–25294.
  250. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv:2111.02114 (2021).
  251. End-to-end generative pretraining for multimodal video captioning. In Proceedings of CVPR. 17959–17968.
  252. URVOS: Unified Referring Video Object Segmentation Network with a Large-Scale Benchmark. In Proceeding of the ECCV. 208–223.
  253. Objects365: A large-scale, high-quality dataset for object detection. In Proceedings of ICCV. 8430–8439.
  254. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of ACL. 2556–2565.
  255. Hollywood in homes: Crowdsourcing data collection for activity understanding. In Proceedings of ECCV. 510–526.
  256. Make-a-video: Text-to-video generation without text-video data. arXiv:2209.14792 (2022).
  257. Deep unsupervised learning using nonequilibrium thermodynamics. In Proceedings of ICML. 2256–2265.
  258. MovieChat: From Dense Token to Sparse Memory for Long Video Understanding. arXiv (2023).
  259. It Takes Two: Masked Appearance-Motion Modeling for Self-supervised Video Transformer Pre-training. arXiv (2022).
  260. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012).
  261. Videobert: A joint model for video and language representation learning. In Proceedings of ICCV. 7464–7473.
  262. Fine-grained Audio-Visual Joint Representations for Multimodal Large Language Models. arXiv:2310.05863 (2023).
  263. Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389 (2023).
  264. Long-form video-language pre-training with multimodal temporal contrastive learning. In Proceedings of NeurIPS. 38032–38045.
  265. Tracking pedestrian heads in dense crowd. In Proceedings of CVPR. 3865–3875.
  266. Vimpac: Video pre-training via masked token prediction and contrastive learning. arXiv preprint arXiv:2106.11250 (2021).
  267. Clip4caption: Clip for video caption. In Proceedings of ACM-MM. 4858–4862.
  268. Coin: A large-scale dataset for comprehensive instructional video analysis. In Proceedings of CVPR. 1207–1216.
  269. Perceiver-vl: Efficient vision-and-language modeling with iterative latent attention. In Proceedings of WACV. 4410–4420.
  270. TVLT: Textless vision-language transformer. Proceedings of NeurIPS (2022), 9617–9632.
  271. Ul2: Unifying language learning paradigms. In Proceedings of ICLR.
  272. Zachary Teed and Jia Deng. 2020. Raft: Recurrent all-pairs field transforms for optical flow. In Proceedings of ECCV. 402–419.
  273. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. In Proceedings of NeurIPS. 10078–10093.
  274. Learning language-visual embedding for movie understanding with natural-language. arXiv:1609.08124 (2016).
  275. Llama: Open and efficient foundation language models. arXiv:2302.13971 (2023).
  276. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of ICCV. 4489–4497.
  277. Towards accurate generative models of video: A new metric & challenges. arXiv:1812.01717 (2018).
  278. Neural discrete representation learning. In Proceedings of NeurIPS.
  279. Attention is all you need. In Proceedings of NeurIPS.
  280. Cider: Consensus-based image description evaluation. In Proceedings of CVPR. 4566–4575.
  281. Show and tell: Lessons learned from the 2015 mscoco image captioning challenge. IEEE PAMI 39, 4 (2016), 652–663.
  282. Towards open-vocabulary video instance segmentation. In Proceedings of ICCV. 4057–4066.
  283. Chatvideo: A tracklet-centric multimodal and versatile video understanding system. arXiv:2304.14407 (2023).
  284. Omnivl: One foundation model for image-language and video-language tasks. In Proceedings of NeurIPS, Vol. 35. 5696–5710.
  285. Object-aware video-language pre-training for retrieval. In Proceedings of CVPR. 3313–3322.
  286. All in one: Exploring unified video-language pre-training. In Proceedings of CVPR. 6598–6608.
  287. Videomae v2: Scaling video masked autoencoders with dual masking. In Proceedings of CVPR. 14549–14560.
  288. M2-CLIP: A Multimodal, Multi-task Adapting Framework for Video Action Recognition. arXiv:2401.11649 (2024).
  289. Actionclip: A new paradigm for video action recognition. arXiv:2109.08472 (2021).
  290. Seeing in Flowing: Adapting CLIP for Action Recognition with Motion Prompts Learning. In Proceedings of ACM Multimedia. 5339–5347.
  291. Bevt: Bert pretraining of video transformers. In Proceedings of CVPR. 14733–14743.
  292. Masked video distillation: Rethinking masked feature modeling for self-supervised video representation learning. In Proceedings of CVPR. 6312–6322.
  293. VideoFactory: Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation. arXiv:2305.10874 (2023).
  294. RTQ: Rethinking Video-language Understanding Based on Image-text Model. In Proceedings of ACM-MM. 557–566.
  295. Videocomposer: Compositional video synthesis with motion controllability. In Proceedings of NeurIPS.
  296. Geb+: A benchmark for generic event boundary captioning, grounding and retrieval. In Proceedings of ECCV. 709–725.
  297. InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation. arXiv (2023).
  298. InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding. arXiv:2403.15377 (2024).
  299. InternVideo: General Video Foundation Models via Generative and Discriminative Learning. arXiv (2022).
  300. Paxion: Patching Action Knowledge in Video-Language Foundation Models. In Proceedings of NeurIPS.
  301. Vita-CLIP: Video and text adaptive CLIP via Multimodal Prompting. In Proceedings of CVPR. 23034–23044.
  302. Masked feature prediction for self-supervised visual pre-training. In Proceedings of CVPR. 14668–14678.
  303. Star: A benchmark for situated reasoning in real-world videos. In Proceedings of NeurIPS.
  304. Godiva: Generating open-domain videos from natural descriptions. arXiv:2104.14806 (2021).
  305. Nüwa: Visual synthesis pre-training for neural visual world creation. In Proceedings of ECCV. 720–736.
  306. Chao-Yuan Wu and Philipp Krahenbuhl. 2021. Towards long-form video understanding. In Proceedings of CVPR. 1884–1894.
  307. Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts. arXiv preprint arXiv:2312.00968 (2023).
  308. General Object Foundation Model for Images and Videos at Scale. arXiv:2312.09158 (2023).
  309. Language as queries for referring video object segmentation. In Proceedings of the CVPR. 4974–4984.
  310. Bidirectional cross-modal knowledge exploration for video recognition with pre-trained vision-language models. In Proceedings of CVPR. 6620–6630.
  311. Next-qa: Next phase of question-answering to explaining temporal actions. In Proceedings of CVPR. 9777–9786.
  312. PaLM2-VAdapter: Progressively Aligned Language Model Makes a Strong Vision-language Adapter. arXiv:2402.10896 (2024).
  313. Video question answering via gradually refined attention over appearance and motion. In Proceedings of ACM-MM. 1645–1653.
  314. VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding. In Proceedings of ACL-IJCNLP. 4227–4239.
  315. VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding. In Proceedings of EMNLP. 6787–6800.
  316. mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video. arXiv (2023).
  317. MuLTI: Efficient Video-and-Language Understanding with MultiWay-Sampler and Multiple Choice Modeling. arXiv (2023).
  318. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of CVPR. 5288–5296.
  319. Advancing high-resolution video-language representation with large-scale video transcriptions. In Proceedings of CVPR. 5036–5045.
  320. CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment. arXiv (2023).
  321. Learning Object State Changes in Videos: An Open-World Perspective. arXiv:2312.11782 (2023).
  322. VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners. arXiv (2023).
  323. Just ask: Learning to answer questions from millions of narrated videos. In Proceedings of ICCV. 1686–1697.
  324. Track Anything: Segment Anything Meets Videos. arXiv (2023).
  325. Video instance segmentation. In Proceedings of ICCV. 5188–5197.
  326. AIM: Adapting Image Models for Efficient Video Action Recognition. In Proceedings of ICLR.
  327. Hitea: Hierarchical temporal-aware video-language pre-training. In Proceedings of CVPR. 15405–15416.
  328. CLEVRER: Collision Events for Video Representation and Reasoning. In Proceedings of ICLR.
  329. Videoprompter: an ensemble of foundational models for zero-shot video understanding. arXiv preprint arXiv:2310.15324 (2023).
  330. Bdd100k: A diverse driving dataset for heterogeneous multitask learning. In Proceedings of CVPR. 2636–2645.
  331. Video paragraph captioning using hierarchical recurrent neural networks. In Proceedings of CVPR. 4584–4593.
  332. Coca: Contrastive captioners are image-text foundation models. arXiv:2205.01917 (2022).
  333. Modeling context in referring expressions. In Proceedings of ECCV. 69–85.
  334. A joint sequence fusion model for video question answering and retrieval. In Proceedings of ECCV. 471–487.
  335. Activitynet-qa: A dataset for understanding complex web videos via question answering. In Proceedings of AAAI. 9127–9134.
  336. VideoGLUE: Video General Understanding Evaluation of Foundation Models. arXiv:2307.03166 (2023).
  337. Mora: Enabling Generalist Video Generation via A Multi-Agent Framework. arXiv:2403.13248 (2024).
  338. Simplifying open-set video domain adaptation with contrastive learning. CVIU (2024), 103953.
  339. Merlot: Multimodal neural script knowledge models. In Proceedings of NeurIPS. 23634–23651.
  340. Merlot: Multimodal neural script knowledge models. In Poceedings of NeurIPS. 23634–23651.
  341. X2-VLM: All-In-One Pre-trained Model For Vision-Language Tasks. arXiv (2023).
  342. Scaling vision transformers. In Proceedings of CVPR. 12104–12113.
  343. Bill Zhang. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016).
  344. TaCA: Upgrading Your Visual Foundation Model with Task-agnostic Compatible Adapter. arXiv (2023).
  345. Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding. arXiv (2023).
  346. Temporal sentence grounding in videos: A survey and future directions. IEEE TPAMI (2023).
  347. Opt: Open pre-trained transformer language models. arXiv:2205.01068 (2022).
  348. UVOSAM: A Mask-free Paradigm for Unsupervised Video Object Segmentation via Segment Anything Model. arXiv (2023).
  349. Hacs: Human action clips and segments dataset for recognition and temporal localization. In Proceedings of ICCV. 8668–8678.
  350. VideoPrism: A Foundational Visual Encoder for Video Understanding. arXiv:2402.13217 (2024).
  351. CenterCLIP: Token Clustering for Efficient Text-Video Retrieval. In Proceedings of SIGIR. 970–981.
  352. Yue Zhao and Philipp Krähenbühl. 2023. Training a large video model on a single machine in a day. arXiv:2309.16669 (2023).
  353. Learning Video Representations from Large Language Models. In Proceedings of CVPR. 6586–6597.
  354. Distilling vision-language models on millions of videos. arXiv preprint arXiv:2401.06129 (2024).
  355. Pointodyssey: A large-scale synthetic dataset for long-term point tracking. In Proceedings of ICCV. 19855–19865.
  356. Magicvideo: Efficient video generation with latent diffusion models. arXiv:2211.11018 (2022).
  357. Towards automatic learning of procedures from web instructional videos. In Proceedings of AAAI.
  358. TransVOD: end-to-end video object detection with spatial-temporal transformers. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022).
  359. A survey on deep learning technique for video segmentation. IEEE TPAMI 45, 6 (2022), 7099–7122.
  360. LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment. arXiv (2023).
  361. Tracking Anything in High Quality. arXiv (2023).
Citations (10)

Summary

We haven't generated a summary for this paper yet.