Video Mobile-Former: Video Recognition with Efficient Global Spatial-temporal Modeling (2208.12257v1)

Published 25 Aug 2022 in cs.CV

Abstract: Transformer-based models have achieved top performance on major video recognition benchmarks. Benefiting from the self-attention mechanism, these models show stronger ability of modeling long-range dependencies compared to CNN-based models. However, significant computation overheads, resulted from the quadratic complexity of self-attention on top of a tremendous number of tokens, limit the use of existing video transformers in applications with limited resources like mobile devices. In this paper, we extend Mobile-Former to Video Mobile-Former, which decouples the video architecture into a lightweight 3D-CNNs for local context modeling and a Transformer modules for global interaction modeling in a parallel fashion. To avoid significant computational cost incurred by computing self-attention between the large number of local patches in videos, we propose to use very few global tokens (e.g., 6) for a whole video in Transformers to exchange information with 3D-CNNs with a cross-attention mechanism. Through efficient global spatial-temporal modeling, Video Mobile-Former significantly improves the video recognition performance of alternative lightweight baselines, and outperforms other efficient CNN-based models at the low FLOP regime from 500M to 6G total FLOPs on various video recognition tasks. It is worth noting that Video Mobile-Former is the first Transformer-based video model which constrains the computational budget within 1G FLOPs.

PDF Abstract

Summarize Bookmark Chat (Pro)

Authors (9)

Rui Wang (996 papers)
Zuxuan Wu (144 papers)
Dongdong Chen (164 papers)
Yinpeng Chen (55 papers)
Xiyang Dai (53 papers)
Mengchen Liu (48 papers)
Luowei Zhou (31 papers)
Lu Yuan (130 papers)
Yu-Gang Jiang (223 papers)

Citations (4)

View on Semantic Scholar

Video Mobile-Former: Video Recognition with Efficient Global Spatial-temporal Modeling (2208.12257v1)

Related Papers