Efficient Transfer Learning for Video-language Foundation Models (2411.11223v4)

Published 18 Nov 2024 in cs.CV

Abstract: Pre-trained vision-LLMs provide a robust foundation for efficient transfer learning across various downstream tasks. In the field of video action recognition, mainstream approaches often introduce additional modules to capture temporal information. Although the additional modules increase the capacity of model, enabling it to better capture video-specific inductive biases, existing methods typically introduce a substantial number of new parameters and are prone to catastrophic forgetting of previously acquired generalizable knowledge. In this paper, we propose a parameter-efficient Multi-modal Spatio-Temporal Adapter (MSTA) to enhance the alignment between textual and visual representations, achieving a balance between generalizable knowledge and task-specific adaptation. Furthermore, to mitigate over-fitting and enhance generalizability, we introduce a spatio-temporal description-guided consistency constraint.This constraint involves providing template inputs (e.g., "a video of {\textbf{cls}}") to the trainable language branch and LLM-generated spatio-temporal descriptions to the pre-trained language branch, enforcing output consistency between the branches. This approach reduces overfitting to downstream tasks and enhances the distinguishability of the trainable branch within the spatio-temporal semantic space. We evaluate the effectiveness of our approach across four tasks: zero-shot transfer, few-shot learning, base-to-novel generalization, and fully-supervised learning. Compared to many state-of-the-art methods, our MSTA achieves outstanding performance across all evaluations, while using only 2-7\% of the trainable parameters in the original model.

Summary

The paper presents the novel MSTA that reduces trainable parameters to 2-7% for efficient video-action recognition.
The paper introduces a spatio-temporal description-guided consistency constraint to balance pre-trained and task-specific knowledge.
The method outperforms state-of-the-art techniques on benchmarks like Kinetics-400 and SomethingSomething V2, enabling deployment in resource-constrained environments.

Efficient Transfer Learning for Video-language Foundation Models

The paper explores the field of efficient transfer learning by addressing the challenges associated with video-language foundation models, particularly in the context of video action recognition tasks. The authors build upon the accomplishments of established multi-modal foundation models such as CLIP and ViCLIP, which are trained on extensive datasets. Acknowledging the limitations of traditional methods that involve training a significant number of parameters, this work introduces the Multi-modal Spatio-Temporal Adapter (MSTA) aimed at enhancing model generalizability without compromising on performance.

The prevalent approaches in video action recognition typically involve the incorporation of additional parameter modules to handle the temporal dimension of video data. However, these methods risk overfitting and, more critically, catastrophic forgetting, which severely limits the model's ability to retain learned generalizable knowledge. To tackle these issues, the MSTA proposes a streamlined approach by optimally aligning text and task-specific knowledge. This is achieved through a novel spatio-temporal description-guided consistency constraint, which maintains a balance between original model knowledge and task-specific adaptations.

The paper's experiments demonstrate the efficacy of MSTA across several benchmarks, outperforming existing state-of-the-art techniques. Notably, it achieves such results while utilizing only 2-7% of the trainable parameters required by previous models. This is particularly significant in deployment environments where computational resources are limited. The experimental results span a diverse set of datasets, including Kinetics-400, SomethingSomething V2, and ActivityNet, indicating the robustness and general applicability of the proposed method.

Key contributions of MSTA include the introduction of projection layers that are modality-specific, enabling independent processing and alignment for both video and language data. The shared unified feature space at its core allows for effective knowledge transfer and gradient optimization, accelerating the model's ability to adapt to new tasks with minimal data while preserving its pre-trained strengths. Furthermore, the consistency constraint reinforced by LLM-generated multi-modal descriptions plays a crucial role in mitigating overfitting by ensuring that the trainable branch does not deviate significantly from the representations of the frozen model.

The development and deployment of models like MSTA presents considerable implications for both theoretical and practical applications. Theoretically, the work emphasizes the importance of maintaining a synergy between learned and new tasks, encouraging further exploration into modular architectures that efficiently balance generalization and specialization. Practically, the reduction in computational cost and parameter count without a trade-off in performance could see more widespread implementation in real-world scenarios, e.g., mobile applications and edge devices where resource constraints are a primary concern.

In summary, this paper demonstrates that significant performance improvements and efficient transfer learning can be achieved via strategically designed multi-modal adapters and consistency constraints. As AI continues to advance, the insights provided in this paper could influence future iterations of foundation models, guiding exploratory work into optimized architectures that extract maximum utility from both learned and novel domains.

PDF Markdown

Related Papers

GitHub

GitHub - chenhaoxing/ETL4Video

Tweets

https://twitter.com/gm8xx8/status/1858785211556782393