Mug-STAN: Adapting Image-Language Pretrained Models for General Video Understanding (2311.15075v1)

Published 25 Nov 2023 in cs.CV

Abstract: Large-scale image-language pretrained models, e.g., CLIP, have demonstrated remarkable proficiency in acquiring general multi-modal knowledge through web-scale image-text data. Despite the impressive performance of image-LLMs on various image tasks, how to effectively expand them on general video understanding remains an area of ongoing exploration. In this paper, we investigate the image-to-video transferring from the perspective of the model and the data, unveiling two key obstacles impeding the adaptation of image-LLMs: non-generalizable temporal modeling and partially misaligned video-text data. To address these challenges, we propose Spatial-Temporal Auxiliary Network with Mutual-guided alignment module (Mug-STAN), a simple yet effective framework extending image-text model to diverse video tasks and video-text data.Specifically, STAN adopts a branch structure with decomposed spatial-temporal modules to enable generalizable temporal modeling, while Mug suppresses misalignment by introducing token-wise feature aggregation of either modality from the other. Extensive experimental results verify Mug-STAN significantly improves adaptation of language-image pretrained models such as CLIP and CoCa at both video-text post-pretraining and finetuning stages. With our solution, state-of-the-art zero-shot and finetuning results on various downstream datasets, including MSR-VTT, DiDeMo, LSMDC, Kinetics-400, Something-Something-2, HMDB-51, UCF- 101, and AVA, are achieved. Moreover, by integrating pretrained Mug-STAN with the emerging multimodal dialogue model, we can realize zero-shot video chatting. Codes are available at https://github.com/farewellthree/STAN

PDF HTML Abstract

Overview of Mug-STAN: Adaptation of Image-LLMs for General Video Understanding

The proliferation of large-scale image-language pretrained models, notably CLIP, has showcased significant advancements by leveraging massive web-scale image-text datasets. Despite their success in various image-centric tasks, the extension of such models to the domain of video understanding remains an elusive challenge. The research paper titled "Mug-STAN: Adapting Image-Language Pretrained Models for General Video Understanding" presents a structured approach to bridge this gap by addressing two principal barriers: the lack of effective temporal modeling and partial misalignment between video and text data.

Methodology and Contributions

The paper introduces the Spatial-Temporal Auxiliary Network with Mutual-guided alignment module (Mug-STAN). This framework serves as a robust solution to enhance the adaptability of image-LLMs for video understanding. The key components, STAN and Mug, address temporal modeling and video-text misalignment, respectively.

1. Spatial-Temporal Auxiliary Network (STAN):

STAN functions as a branch alongside the pretrained visual encoder, facilitating temporal learning by integrating spatial-temporal contexts at multiple levels. Unlike the posterior and intermediate structures used in traditional methods, STAN's branch structure enables:

Multi-Level Feature Utilization: By leveraging features at different abstraction levels from the pretrained model, STAN captures both high-level semantic alignments and low-level spatial-temporal patterns.
Parameter Efficient Temporal Modeling: Exploiting a separated spatial-temporal design, STAN reuses the structure of the pretrained visual layers, which aids in efficient temporal understanding without disrupting the pretrained knowledge.

2. Mutual-Guided Alignment (Mug):

Mug targets the prevalent partial misalignment issues in video-text datasets by:

Token-Frame Interaction Modeling: It performs token-wise interaction between frames and text, dynamically identifying and aligning the most relevant parts of the two modalities.
Feature Aggregation through Mutual Guidance: The cross-modal enhancement allows more accurate representation by amplifying corresponding segments and suppressing irrelevant noise, thus improving overall alignment.

Empirical Evaluation

The efficacy of Mug-STAN is demonstrated through extensive experiments across multiple video-related tasks including text-video retrieval, action recognition, and temporal action localization. Notable results include:

Superior Performance in Zero-Shot and Finetuning Settings: Mug-STAN achieves state-of-the-art results on datasets such as MSR-VTT, DiDeMo, LSMDC, Kinetics-400, and Something-Something-v2. The integration of pretrained Mug-STAN into multimodal dialogue models further showcased the capability of zero-shot video chatting.
Improved Generalization: When compared to existing models, Mug-STAN demonstrated enhanced generalization across diverse tasks, attributed to its effective temporal modeling and amelioration of cross-modal misalignment.

Future Directions

The paper proposes several implications for future research:

Application to Diverse V-L Pretrained Models: The flexibility and robust performance of Mug-STAN suggest potential adaptation to various V-L pretrained architectures beyond CLIP and CoCa.
Post-Pretraining on Diverse Datasets: The framework shows promise in post-pretraining settings using datasets with varying noise levels, such as WebVid10M and HowTo100M.
Integration with Multimodal Architectures: Leveraging STAN’s capabilities in video temporal modeling could facilitate enhanced integration in larger multimodal LLM systems.

In summary, Mug-STAN elegantly addresses the core challenges hindering the extension of image-LLMs to video tasks. By leveraging its novel mechanism for temporal modeling and cross-modal alignment, the framework proves itself as a powerful tool in the field of video understanding, laying groundwork for both theoretical exploration and practical applications in AI.

PDF Markdown Bookmark Chat (Pro)

References (80)

Authors (5)

Ruyang Liu (9 papers)
Jingjia Huang (12 papers)
Wei Gao (203 papers)
Thomas H. Li (32 papers)
Ge Li (213 papers)

Citations (3)

View on Semantic Scholar

GitHub

GitHub - farewellthree/STAN: Official PyTorch implementation of the paper "Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring" (98 stars)