Foundation Models for Video Understanding: A Survey (2405.03770v1)

Published 6 May 2024 in cs.CV

Abstract: Video Foundation Models (ViFMs) aim to learn a general-purpose representation for various video understanding tasks. Leveraging large-scale datasets and powerful models, ViFMs achieve this by capturing robust and generic features from video data. This survey analyzes over 200 video foundational models, offering a comprehensive overview of benchmarks and evaluation metrics across 14 distinct video tasks categorized into 3 main categories. Additionally, we offer an in-depth performance analysis of these models for the 6 most common video tasks. We categorize ViFMs into three categories: 1) Image-based ViFMs, which adapt existing image models for video tasks, 2) Video-Based ViFMs, which utilize video-specific encoding methods, and 3) Universal Foundational Models (UFMs), which combine multiple modalities (image, video, audio, and text etc.) within a single framework. By comparing the performance of various ViFMs on different tasks, this survey offers valuable insights into their strengths and weaknesses, guiding future advancements in video understanding. Our analysis surprisingly reveals that image-based foundation models consistently outperform video-based models on most video understanding tasks. Additionally, UFMs, which leverage diverse modalities, demonstrate superior performance on video tasks. We share the comprehensive list of ViFMs studied in this work at: \url{https://github.com/NeeluMadan/ViFM_Survey.git}

References (361)

Citations (10)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Tweets

https://twitter.com/_reachsumit/status/1788074485608067183

https://twitter.com/morris_phd/status/1790899018261799417

https://twitter.com/gm8xx8/status/1788075405947162897

https://twitter.com/realmofresearch/status/1788068603650039996

https://twitter.com/CSVisionPapers/status/1788091716827828241

https://twitter.com/gastronomy/status/1788058355409326082

Foundation Models for Video Understanding: A Survey (2405.03770v1)

Summary

Related Papers

Tweets