VideoGLUE: Video General Understanding Evaluation of Foundation Models (2307.03166v3)

Published 6 Jul 2023 in cs.CV

Abstract: We evaluate the video understanding capabilities of existing foundation models (FMs) using a carefully designed experiment protocol consisting of three haLLMark tasks (action recognition,temporal localization, and spatiotemporal localization), eight datasets well received by the community, and four adaptation methods tailoring an FM for downstream tasks. Furthermore, we jointly profile FMs' efficacy and efficiency when adapting to general video understanding tasks using cost measurements during both training and inference. Our main findings areas follows. First, task-specialized models significantly outperform the seven FMs studied in this work, in sharp contrast to what FMs have achieved in natural language and image understanding. Second, video-native FMs, whose pretraining data mainly contains the video modality, are generally better than image-native FMs in classifying motion-rich videos, localizing actions in time, and understanding a video of more than one action. Third, the video-native FMs can perform well on video tasks under light adaptations to downstream tasks (e.g., freezing the FM backbones), while image-native FMs win in full end-to-end finetuning. The first two observations reveal the need and tremendous opportunities to conduct research on video-focused FMs, and the last confirms that both tasks and adaptation methods matter when it comes to the evaluation of FMs. Our code is released under: https://github.com/tensorflow/models/tree/master/official/projects/videoglue.

PDF Abstract

VideoGLUE: Evaluating Video Understanding in Foundation Models

The paper "VideoGLUE: Video General Understanding Evaluation of Foundation Models" presents a systematic approach to evaluate the video understanding capabilities of foundation models (FMs). The paper explores multiple facets of video tasks using a comprehensive experimental protocol, addressing the gap between video-specialized models and FMs.

Core Contributions and Findings

The authors focus on six foundation models: CoCa, CLIP, FLAVA, VideoMAE, VATT, and InternVideo. These models are assessed across three haLLMark tasks—action recognition, temporal localization, and spatiotemporal localization—using eight widely recognized datasets. The paper introduces a VideoGLUE score (VGS) to quantify an FM’s efficacy and efficiency in adapting to video understanding tasks.

Key findings include:

Performance Discrepancy: Task-specialized models outperform the evaluated FMs on video tasks, contrasting the success of FMs in natural language and image understanding. This highlights the necessity to investigate video-focused FMs further.
Video-native vs. Image-native FMs: Models pretrained on video data (video-native FMs) generally surpass image-native FMs, particularly in tasks requiring temporal reasoning. This underscores the importance of integrating motion cues in video tasks.
Adaptation Strategies: Different adaptation methods, such as end-to-end finetuning and using frozen features with multi-layer attention, reveal varying strengths of the FMs. The effectiveness of adaptation methods is pivotal, altering the performance landscape significantly.

Adaptation Methods

The paper details four adaptation methods—end-to-end finetuning, frozen backbone, multi-layer attention pooling, and low-rank adapters—that cater to diverse application scenarios and computational constraints. Each method presents a unique angle to evaluate an FM’s ability to handle video tasks efficiently, offering different insights into their potential.

Implications and Future Directions

The results highlight tremendous opportunities for advancing video-native foundation models, advocating for better pretraining data and methodologies focused on motion-rich content. The paper confirms that both the choice of tasks and adaptation methods are critical in evaluating FMs, suggesting a need for cohesive protocols in FM assessments.

For theoretical implications, the research contributes to the understanding of domain adaptation and generalization of FMs beyond traditional language and image tasks. Practically, it calls for heightened focus on developing robust, video-oriented models capable of capturing the temporal dynamics intrinsic to video data.

Conclusion

Overall, this paper systematically examines foundation models in the context of video understanding, providing a framework for future research. The introduction of the VideoGLUE score offers a quantitative means to gauge FM performance across video tasks, paving the way for standardized evaluations. The insights garnered are poised to stimulate further exploration and development in foundation models with an emphasis on video data.

PDF Markdown Bookmark Chat (Pro)

Authors (17)

Liangzhe Yuan (19 papers)
Nitesh Bharadwaj Gundavarapu (3 papers)
Long Zhao (64 papers)
Hao Zhou (351 papers)
Yin Cui (45 papers)
Lu Jiang (90 papers)
Xuan Yang (49 papers)
Menglin Jia (17 papers)
Tobias Weyand (14 papers)
Luke Friedman (7 papers)
Mikhail Sirotenko (10 papers)
Huisheng Wang (18 papers)
Florian Schroff (21 papers)
Hartwig Adam (49 papers)
Ming-Hsuan Yang (376 papers)
Ting Liu (329 papers)
Boqing Gong (100 papers)

Citations (7)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

Tweets

https://twitter.com/CSVisionPapers/status/1851083654153347534