Multi-entity Video Transformers for Fine-Grained Video Representation Learning (2311.10873v1)

Published 17 Nov 2023 in cs.CV

Abstract: The area of temporally fine-grained video representation learning aims to generate frame-by-frame representations for temporally dense tasks. In this work, we advance the state-of-the-art for this area by re-examining the design of transformer architectures for video representation learning. A salient aspect of our self-supervised method is the improved integration of spatial information in the temporal pipeline by representing multiple entities per frame. Prior works use late fusion architectures that reduce frames to a single dimensional vector before any cross-frame information is shared, while our method represents each frame as a group of entities or tokens. Our Multi-entity Video Transformer (MV-Former) architecture achieves state-of-the-art results on multiple fine-grained video benchmarks. MV-Former leverages image features from self-supervised ViTs, and employs several strategies to maximize the utility of the extracted features while also avoiding the need to fine-tune the complex ViT backbone. This includes a Learnable Spatial Token Pooling strategy, which is used to identify and extract features for multiple salient regions per frame. Our experiments show that MV-Former not only outperforms previous self-supervised methods, but also surpasses some prior works that use additional supervision or training data. When combined with additional pre-training data from Kinetics-400, MV-Former achieves a further performance boost. The code for MV-Former is available at https://github.com/facebookresearch/video_rep_learning.

PDF Abstract

Summarize Bookmark Chat (Pro)

Authors (6)

Matthew Walmer (6 papers)
Rose Kanjirathinkal (1 paper)
Kai Sheng Tai (11 papers)
Keyur Muzumdar (2 papers)
Taipeng Tian (5 papers)
Abhinav Shrivastava (120 papers)

GitHub

GitHub - facebookresearch/video_rep_learning: SSL Video Representation Learning project (10 stars)

Multi-entity Video Transformers for Fine-Grained Video Representation Learning (2311.10873v1)

Related Papers

GitHub