TWLV-I: Analysis and Insights from Holistic Evaluation on Video Foundation Models (2408.11318v2)

Published 21 Aug 2024 in cs.CV

Abstract: In this work, we discuss evaluating video foundation models in a fair and robust manner. Unlike language or image foundation models, many video foundation models are evaluated with differing parameters (such as sampling rate, number of frames, pretraining steps, etc.), making fair and robust comparisons challenging. Therefore, we present a carefully designed evaluation framework for measuring two core capabilities of video comprehension: appearance and motion understanding. Our findings reveal that existing video foundation models, whether text-supervised like UMT or InternVideo2, or self-supervised like V-JEPA, exhibit limitations in at least one of these capabilities. As an alternative, we introduce TWLV-I, a new video foundation model that constructs robust visual representations for both motion- and appearance-based videos. Based on the average top-1 accuracy of linear probing on five action recognition benchmarks, pretrained only on publicly accessible datasets, our model shows a 4.6%p improvement compared to V-JEPA (ViT-L) and a 7.7%p improvement compared to UMT (ViT-L). Even when compared to much larger models, our model demonstrates a 7.2%p improvement compared to DFN (ViT-H), a 2.7%p improvement compared to V-JEPA (ViT-H) and a 2.8%p improvement compared to InternVideo2 (ViT-g). We provide embedding vectors obtained by TWLV-I from videos of several commonly used video benchmarks, along with evaluation source code that can directly utilize these embeddings. The code is available at https://github.com/twelvelabs-io/video-embeddings-evaluation-framework.

PDF HTML Abstract

Insights and Evaluation of TWLV-I in Video Foundation Models

In this essay, we will discuss the essential findings and implications presented in "TWLV-I: Analysis and Insights from Holistic Evaluation on Video Foundation Models." The paper introduces TWLV-I, a new video foundation model designed to address the intrinsic challenges of evaluating video foundation models (VFMs).

Overview of the Paper

The paper discusses the complexities involved in evaluating VFMs due to the varied parameters like sampling rate, number of frames, pretraining steps, etc., which make fair and robust comparisons challenging. Hence, the authors propose a holistic evaluation framework that focuses on two core aspects of video comprehension: appearance and motion understanding. Existing VFMs either text-supervised like UMT and InternVideo2, or self-supervised like V-JEPA, are shown to exhibit deficiencies in consistently handling both aspects.

TWLV-I is introduced as a solution, aiming to construct robust visual representations for both motion and appearance. The paper is thorough in its approach, presenting results from five action recognition benchmarks and demonstrating the efficacy of TWLV-I compared to its peers. Strong empirical results indicate that TWLV-I shows a 4.6% top-1 accuracy improvement over V-JEPA (ViT-L scale) and a 7.7% improvement over UMT (ViT-L scale). Even against larger-scale models like DFN (ViT-H), TWLV-I demonstrates a 7.2% improvement.

Methodologies and Results

Evaluation Framework

The evaluation framework is a pivotal contribution, as it provides a robust set of methodologies for assessing VFMs. The framework proposes multiple benchmark tasks such as:

Action Recognition (AR)
Temporal Action Localization (TAL)
Spatio-Temporal Action Localization (STAL)
Temporal Action Segmentation (TAS)

These tasks are designed to comprehensively evaluate a model's performance in handling both appearance and motion in videos.

Action Recognition

The action recognition task evaluates the model's capability to classify actions based on video frames. TWLV-I shows superior performance across appearance-centric benchmarks such as Kinetics-400 (K400) and motion-centric benchmarks like Something-Something-v2 (SSv2).

Temporal Action Localization

In temporal action localization, the model identifies the temporal regions where actions occur. The results indicate that TWLV-I outperforms models of similar and larger scales in both ActivityNet v1.3 and THUMOS14 datasets. This superior performance highlights TWLV-I's robust capacity to capture temporal patterns.

Spatio-Temporal Action Localization

This task demands the model to localize actions within both spatial and temporal dimensions. Evaluations using the AVA v2.2 dataset show that TWLV-I exhibits competitive performance, comparable to or even slightly better than models like UMT and InternVideo2.

Temporal Action Segmentation

For temporal action segmentation, TWLV-I proves its robustness across different views (top-down, ego-centric, broader view), excelling in 50Salads, GTEA, and Breakfast datasets. The model's strong performance in segmentation accuracy and edit scores underlines its comprehensive spatial and temporal understanding capability.

Embedding Visualization and Analysis

The paper presents t-SNE visualizations showing clear clustering capabilities of TWLV-I, UMT, and InternVideo2 in appearance-centric tasks. However, none of the models effectively cluster the motion-centric SSv2 dataset, indicating a need for further enhancements. Moreover, Linear Discriminant Analysis (LDA) is used to demonstrate how well the models understand directional motion, with TWLV-I and V-JEPA showing remarkable distinguishability between forward and reversed video embeddings.

Future Directions and Conclusion

The paper outlines potential advancements such as scaling up model sizes and improving image embedding capabilities. Expanding modalities to include multi-modal tasks like video retrieval and captioning are also suggested as future research directions.

In conclusion, the paper provides a comprehensive evaluation framework and introduces TWLV-I, establishing its superiority in both appearance and motion understanding in videos. The analysis methods and substantial empirical results indicate that TWLV-I significantly advances the state-of-the-art in video foundation models, offering robust performance across various video-centric tasks.