MorphMLP: An Efficient MLP-Like Backbone for Spatial-Temporal Representation Learning (2111.12527v3)

Published 24 Nov 2021 in cs.CV

Abstract: Recently, MLP-Like networks have been revived for image recognition. However, whether it is possible to build a generic MLP-Like architecture on video domain has not been explored, due to complex spatial-temporal modeling with large computation burden. To fill this gap, we present an efficient self-attention free backbone, namely MorphMLP, which flexibly leverages the concise Fully-Connected (FC) layer for video representation learning. Specifically, a MorphMLP block consists of two key layers in sequence, i.e., MorphFC_s and MorphFC_t, for spatial and temporal modeling respectively. MorphFC_s can effectively capture core semantics in each frame, by progressive token interaction along both height and width dimensions. Alternatively, MorphFC_t can adaptively learn long-term dependency over frames, by temporal token aggregation on each spatial location. With such multi-dimension and multi-scale factorization, our MorphMLP block can achieve a great accuracy-computation balance. Finally, we evaluate our MorphMLP on a number of popular video benchmarks. Compared with the recent state-of-the-art models, MorphMLP significantly reduces computation but with better accuracy, e.g., MorphMLP-S only uses 50% GFLOPs of VideoSwin-T but achieves 0.9% top-1 improvement on Kinetics400, under ImageNet1K pretraining. MorphMLP-B only uses 43% GFLOPs of MViT-B but achieves 2.4% top-1 improvement on SSV2, even though MorphMLP-B is pretrained on ImageNet1K while MViT-B is pretrained on Kinetics400. Moreover, our method adapted to the image domain outperforms previous SOTA MLP-Like architectures. Code is available at https://github.com/MTLab/MorphMLP.

Citations (28)

View on Semantic Scholar

Summary

The paper introduces MorphMLP, a lightweight MLP-based architecture that boosts top-1 accuracy while reducing GFLOPs on video tasks.
It employs dedicated spatial (MorphFC_s) and temporal (MorphFC_t) layers to capture hierarchical token interactions and long-range dependencies.
Empirical results on Kinetics-400 and Something-Something V2 demonstrate superior performance over transformer models, underscoring its efficiency and scalability.

MorphMLP: An Efficient MLP-Like Backbone for Spatial-Temporal Representation Learning

The exploration of lightweight, efficient neural network architectures remains at the forefront of computer vision research. The paper "MorphMLP: An Efficient MLP-Like Backbone for Spatial-Temporal Representation Learning" addresses the challenge of developing an efficient MLP-based framework tailored for video processing, circumventing the heavy computational demands typically associated with self-attention mechanisms. The authors propose MorphMLP, an architecture that leverages fully connected layers for effective spatial-temporal representation learning.

MorphMLP comprises two primary layers: $\text{MorphFC}_s$ and $\text{MorphFC}_t$ . The $\text{MorphFC}_s$ layer is designed to enhance spatial modeling by progressively interacting with tokens along both the height and width dimensions of image frames. This approach allows the network to effectively grasp core semantic features through stages of hierarchical token interaction, analogous to how convolutional layers expand receptive fields. The $\text{MorphFC}_t$ layer, on the other hand, targets temporal modeling by aggregating tokens over multiple frames, allowing for an adaptive learning of long-range dependencies.

The authors present a compelling case for MorphMLP's efficiency, demonstrating that it outperforms existing state-of-the-art models in terms of both accuracy and computational resources. For instance, MorphMLP-S achieves a 0.9% improvement in top-1 accuracy on the Kinetics-400 dataset while requiring only 50% of the GFLOPs of VideoSwin-T, a contemporary transformer-based model. Furthermore, MorphMLP-B shows a 2.4% boost in top-1 accuracy on Something-Something V2 with 43% of the GFLOPS of MViT-B, illustrating MorphMLP's scalability and efficacy across differing scales and datasets.

The paper also provides insights into the feasibility of transferring MorphMLP's designs across domains. The architecture, initially tailored for video recognition, can be adapted to image recognition tasks on datasets such as ImageNet-1K by simplifying the network to exclude temporal modeling. The results show competitive performance with prior SOTA MLP-Like architectures in the image domain, suggesting MorphMLP's versatility as a backbone in various computer vision tasks.

The implications of this work are twofold: practically, MorphMLP offers a tractable alternative to computationally expensive attention-based backbones, supporting deployment in resource-constrained environments, particularly beneficial for mobile and edge devices. Theoretically, this research underscores the potential of MLP-Like architectures in evolving domains traditionally dominated by CNN and Transformer models. As neural network design gravitates towards more efficient computation models, the methods demonstrated in MorphMLP signal promising trajectories for future developments in MLP architectures.

Speculatively, continued research could explore the integration of MorphMLP with conditions for dynamic token scaling or adaptive parameter sharing among layers to further refine efficiency. Additionally, expanding the application of MorphMLP into other spatiotemporal domains, like 3D point cloud processing or audio-visual fusion, could be lucrative areas of investigation. This paper contributes significantly to the re-evaluation and re-imagination of MLP architectures within the context of state-of-the-art video representation learning.

PDF Markdown

Related Papers

GitHub

GitHub - MTLab/MorphMLP (142 stars)