MoViNets: Mobile Video Networks for Efficient Video Recognition (2103.11511v2)

Published 21 Mar 2021 in cs.CV, cs.AI, and cs.LG

Abstract: We present Mobile Video Networks (MoViNets), a family of computation and memory efficient video networks that can operate on streaming video for online inference. 3D convolutional neural networks (CNNs) are accurate at video recognition but require large computation and memory budgets and do not support online inference, making them difficult to work on mobile devices. We propose a three-step approach to improve computational efficiency while substantially reducing the peak memory usage of 3D CNNs. First, we design a video network search space and employ neural architecture search to generate efficient and diverse 3D CNN architectures. Second, we introduce the Stream Buffer technique that decouples memory from video clip duration, allowing 3D CNNs to embed arbitrary-length streaming video sequences for both training and inference with a small constant memory footprint. Third, we propose a simple ensembling technique to improve accuracy further without sacrificing efficiency. These three progressive techniques allow MoViNets to achieve state-of-the-art accuracy and efficiency on the Kinetics, Moments in Time, and Charades video action recognition datasets. For instance, MoViNet-A5-Stream achieves the same accuracy as X3D-XL on Kinetics 600 while requiring 80% fewer FLOPs and 65% less memory. Code will be made available at https://github.com/tensorflow/models/tree/master/official/vision.

PDF Abstract

Mobile Video Networks for Efficient Video Recognition: An Overview

The paper "Mobile Video Networks" (MoViNets) addresses the growing demand for efficient and accurate video recognition models capable of operating on mobile devices. The authors present a comprehensive approach to designing computation and memory-efficient 3D convolutional neural networks (CNNs) suitable for online inference on mobile platforms. This research is particularly relevant given the constraints posed by mobile environments in terms of power, computational capability, and memory availability, alongside the increasing prevalence of streaming and real-time video applications.

Key Contributions

Neural Architecture Search (NAS) for Video Networks: The authors propose a novel search space specifically designed for video networks, employing NAS to discover optimal architectures. This approach balances between spatial, temporal, and spatiotemporal operations, enabling significant gains in accuracy and efficiency.
Stream Buffer Technique: MoViNet introduces the Stream Buffer, which decouples memory usage from video clip duration. This innovation allows the network to handle any-length streaming inputs with a constant memory footprint, making it feasible to perform inference on longer sequences without substantial memory overhead.
Temporal Ensembles: A simple yet effective ensembling technique is proposed to boost the model accuracy without sacrificing computational efficiency. This method recovers the accuracy potentially lost due to the stream buffers, contributing to the networks' capability to achieve state-of-the-art performance on prominent video recognition datasets such as Kinetics and Moments in Time.

Numerical Results and Claims

The MoViNet family exhibits compelling performance metrics, achieving state-of-the-art accuracy with substantially fewer floating point operations (FLOPs) and reduced memory consumption. For example, MoViNet-A5 is shown to match the accuracy of X3D-XL on the Kinetics 600 dataset while requiring 80% fewer FLOPs and 65% less memory. Such results highlight the efficacy of the proposed methods in reducing the computational demands typically associated with 3D CNNs for video tasks.

Implications and Future Directions

From a practical standpoint, the research opens the door for deploying high-accuracy video recognition models on resource-constrained mobile devices, which could significantly impact various applications, from mobile cameras to autonomous systems and IoT devices. Theoretically, the integration of stream buffers and causal operations suggests new avenues for exploring temporal sequence modeling in deep learning, particularly for real-time systems.

Future research may extend the MoViNet framework to other domains requiring temporal modeling, refine NAS for even greater architecture efficiency, or integrate additional modalities like audio and 3D data. The paper's approach may serve as a foundation for further exploration into lightweight and responsive AI models that can operate effectively at the edge.

In conclusion, this work contributes to the field by bridging the gap between the computational intensity of 3D CNNs and the pragmatic constraints of mobile environments. The innovative techniques and robust experimental validation present a promising step toward democratizing video recognition across diverse application scenarios. As the deployed code suggests, these developments are expected to encourage widespread adoption and further innovation within the community.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Dan Kondratyuk (11 papers)
Liangzhe Yuan (19 papers)
Yandong Li (38 papers)
Li Zhang (690 papers)
Mingxing Tan (45 papers)
Matthew Brown (33 papers)
Boqing Gong (100 papers)

Citations (213)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

models/official/vision at master · tensorflow/models · GitHub