AIM: Adapting Image Models for Efficient Video Action Recognition (2302.03024v1)

Published 6 Feb 2023 in cs.CV

Abstract: Recent vision transformer based video models mostly follow the ``image pre-training then finetuning" paradigm and have achieved great success on multiple video benchmarks. However, full finetuning such a video model could be computationally expensive and unnecessary, given the pre-trained image transformer models have demonstrated exceptional transferability. In this work, we propose a novel method to Adapt pre-trained Image Models (AIM) for efficient video understanding. By freezing the pre-trained image model and adding a few lightweight Adapters, we introduce spatial adaptation, temporal adaptation and joint adaptation to gradually equip an image model with spatiotemporal reasoning capability. We show that our proposed AIM can achieve competitive or even better performance than prior arts with substantially fewer tunable parameters on four video action recognition benchmarks. Thanks to its simplicity, our method is also generally applicable to different image pre-trained models, which has the potential to leverage more powerful image foundation models in the future. The project webpage is \url{https://adapt-image-models.github.io/}.

PDF Abstract

AIM: Adapting Image Models for Efficient Video Action Recognition

The paper "AIM: Adapting Image Models for Efficient Video Action Recognition" proposes a novel framework that addresses computational inefficiencies in video action recognition utilizing vision transformer-based models. The immense success of vision transformers in image recognition has sparked interest in their application to video data. However, full finetuning of these large models on video tasks is resource-intensive. AIM proposes a solution by introducing parameter-efficient transfer learning, using lightweight Adapters to enhance pre-trained image transformers for video understanding without complete retraining.

Methodology Overview

The AIM framework introduces spatial adaptation, temporal adaptation, and joint adaptation to equip a frozen image transformer model with spatiotemporal reasoning capability. The methodology is structured as follows:

Spatial Adaptation: After freezing the pre-trained image model, a lightweight Adapter is introduced following the self-attention layer of the transformer's architecture. This configuration allows the model to efficiently refine spatial representations from video inputs without altering the foundational parameters.
Temporal Adaptation: By reapplying the self-attention mechanism along the temporal dimension, the model captures inter-frame relationships inherent in video data. This reuse strategy, complemented by an additional Adapter, allows the model to incorporate temporal dynamics with minimal parameter adjustments.
Joint Adaptation: A further Adapter is incorporated in parallel with the transformer's MLP layer to foster a comprehensive spatiotemporal feature fusion, enabling joint refinement of spatial and temporal features.

Experimental Results

The AIM framework exhibits competitive performance across multiple video action recognition benchmarks: Kinetics-400, Kinetics-700, Something-Something v2, and Diving-48. With pre-trained backbones such as ViT and Swin, AIM consistently achieves comparable or superior accuracy compared to fully finetuned models, all while significantly reducing the parameter count and computational resource demands. Particularly, the architecture demonstrates a remarkable data efficiency, strengthening its capability in low-data regimes.

Implications and Future Work

The paper presents important implications for the development and deployment of deep learning models in video recognition. By advancing parameter-efficient finetuning methodologies, AIM bridges the performance-cost gap inherent in adapting image-centric models to video tasks. The proposed method not only optimizes computational efficiency but also maintains, or even enhances, model performance.

The framework's flexibility indicates potential applicability to a variety of backbones and pre-trained models, including future iterations of more extensive image or even multi-modal models, emphasizing its scalable nature. Future work could explore refining temporal adaptation for datasets heavily reliant on temporal cues, as the current approach of reusing spatial attention may not fully exploit temporal nuances.

In summary, AIM provides a promising pathway for leveraging existing powerful image models in video recognition tasks through strategic, efficient adaptations, highlighting a significant step towards cost-effective, scalable AI model deployment.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Taojiannan Yang (26 papers)
Yi Zhu (233 papers)
Yusheng Xie (22 papers)
Aston Zhang (48 papers)
Chen Chen (752 papers)
Mu Li (95 papers)

Citations (122)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos