Multiscale Vision Transformers (2104.11227v1)

Published 22 Apr 2021 in cs.CV, cs.AI, and cs.LG

Abstract: We present Multiscale Vision Transformers (MViT) for video and image recognition, by connecting the seminal idea of multiscale feature hierarchies with transformer models. Multiscale Transformers have several channel-resolution scale stages. Starting from the input resolution and a small channel dimension, the stages hierarchically expand the channel capacity while reducing the spatial resolution. This creates a multiscale pyramid of features with early layers operating at high spatial resolution to model simple low-level visual information, and deeper layers at spatially coarse, but complex, high-dimensional features. We evaluate this fundamental architectural prior for modeling the dense nature of visual signals for a variety of video recognition tasks where it outperforms concurrent vision transformers that rely on large scale external pre-training and are 5-10x more costly in computation and parameters. We further remove the temporal dimension and apply our model for image classification where it outperforms prior work on vision transformers. Code is available at: https://github.com/facebookresearch/SlowFast

PDF Abstract

Analyzing Multiscale Vision Transformers (MViT) for Effective Video and Image Recognition

The paper on Multiscale Vision Transformers (MViT) presents an innovative approach to video and image recognition by synthesizing principles from hierarchical neural networks and modern transformer architectures. This essay provides an expert overview of the significant contributions, results, and implications of the research presented in this work.

Introduction

The development of neural networks for computer vision has evolved significantly, drawing initial inspiration from biological systems. Hubel and Wiesel's hierarchical model of the visual cortex laid the groundwork, followed by structures like the Neocognitron and early convolutional neural networks (CNNs). Parallelly, the concepts of multiscale processing with pyramid structures emerged to reduce computational requirements and enhance context understanding.

Transformers, initially designed for natural language processing, employ attention mechanisms to model dependencies, and recent adaptations to vision tasks have shown promising results. Vision Transformers (ViT) primarily use a single-scale attention mechanism, which maintains constant resolution and channel dimensions through all network layers.

Multiscale Vision Transformers: Concept and Architecture

MViT introduces a novel architectural approach that integrates multiscale feature hierarchies into the transformer model. The key idea is to expand the channel capacity hierarchically while decreasing the spatial resolution progressively, creating a feature pyramid within the transformer network. This multiscale design allows early layers to process high-resolution, simple visual features, while deeper layers handle coarse, complex features.

Multi Head Pooling Attention (MHPA)

A central component of MViT is the Multi Head Pooling Attention (MHPA) mechanism, which pools key, query, and value tensors independently to adjust their sequence lengths. This pooling operation is designed to efficiently manage computational complexity, making it scalable for large-scale visual data. The MHPA operator enables MViT to operate at variable resolutions across different stages of the network.

Scale Stages and Channel Expansion

MViT's architecture is segmented into multiple scale stages, with each stage consisting of transformer blocks operating at consistent resolution and channel dimensions. At the transition between stages, the network reduces spatial resolution through query pooling while expanding the channel capacity. This hierarchical, multiscale approach contrasts sharply with the single-scale design of traditional ViTs, leading to significant computational savings and improved modeling of dense visual signals.

Experimental Results and Analysis

The experimental evaluation of MViT spans various datasets and benchmarks, showcasing its superiority over existing methods in both performance and efficiency.

Video Recognition on Kinetics

MViT demonstrated substantial gains on the Kinetics-400 and Kinetics-600 datasets, outperforming concurrent Vision Transformer models and traditional CNNs. Notably, MViT models exhibited significant improvements without reliance on large-scale pre-training datasets like ImageNet-21K, unlike their counterparts such as ViViT and TimeSformer. For example, MViT-B, 64×3 achieved an impressive 81.2% top-1 accuracy on Kinetics-400 with significantly fewer computational resources than ViViT.

Temporal Understanding and Frame Shuffling

The paper showed that traditional ViT models failed to leverage temporal information effectively, as evidenced by minimal accuracy decline when frames were shuffled during testing. In contrast, MViT models exhibited a significant drop in accuracy under similar conditions, highlighting their better utilization of temporal structures in videos.

Transferability to Other Datasets

MViT's robust performance was further validated on datasets like Something-Something-v2, Charades, and AVA. Across these benchmarks, MViT consistently outperformed state-of-the-art methods, demonstrating its capability to generalize and adapt to different domains of video and image recognition tasks.

Computational Efficiency

MViT models proved to be more computationally efficient compared to their single-scale ViT counterparts. The multi-head pooling attention mechanism, combined with the hierarchical resolution downsampling, helped in reducing both memory and computational costs while maintaining or even improving accuracy.

Implications and Future Directions

The multiscale architecture of MViT provides a promising direction for future research in visual recognition and transformer-based models. This hierarchical approach offers a pathway for more efficient and scalable deep learning models, particularly in domains requiring processing of large and dense visual datasets.

Practical Integration and Deployment

From a practical standpoint, the design of MViT allows for efficient deployment in resource-constrained environments without compromising performance, making it suitable for real-world applications in video analytics, autonomous systems, and multimedia processing.

Theoretical Advancements

Theoretically, MViT bridges the gap between traditional CNN-inspired multiscale processing and transformer architectures, opening avenues for further exploration of hybrid models that leverage the strengths of both paradigms.

Conclusion

Multiscale Vision Transformers represent a significant advancement in the field of visual recognition, merging the concepts of hierarchical feature extraction with the flexibility of transformers. The empirical results underscore the efficacy of multiscale design in achieving superior accuracy with reduced computational demand, positioning MViT as a foundational architecture for future research and development in computer vision.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Haoqi Fan (33 papers)
Bo Xiong (84 papers)
Karttikeya Mangalam (32 papers)
Yanghao Li (43 papers)
Zhicheng Yan (26 papers)
Jitendra Malik (211 papers)
Christoph Feichtenhofer (52 papers)

Citations (1,115)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - facebookresearch/SlowFast: PySlowFast: video understanding codebase from FAIR for reproducing state-of-the-art video models. (6,343 stars)

Tweets

https://twitter.com/arankomatsuzaki/status/1385394672147734528