MobileNetV2: Inverted Residuals and Linear Bottlenecks (1801.04381v4)

Published 13 Jan 2018 in cs.CV

Abstract: In this paper we describe a new mobile architecture, MobileNetV2, that improves the state of the art performance of mobile models on multiple tasks and benchmarks as well as across a spectrum of different model sizes. We also describe efficient ways of applying these mobile models to object detection in a novel framework we call SSDLite. Additionally, we demonstrate how to build mobile semantic segmentation models through a reduced form of DeepLabv3 which we call Mobile DeepLabv3. The MobileNetV2 architecture is based on an inverted residual structure where the input and output of the residual block are thin bottleneck layers opposite to traditional residual models which use expanded representations in the input an MobileNetV2 uses lightweight depthwise convolutions to filter features in the intermediate expansion layer. Additionally, we find that it is important to remove non-linearities in the narrow layers in order to maintain representational power. We demonstrate that this improves performance and provide an intuition that led to this design. Finally, our approach allows decoupling of the input/output domains from the expressiveness of the transformation, which provides a convenient framework for further analysis. We measure our performance on Imagenet classification, COCO object detection, VOC image segmentation. We evaluate the trade-offs between accuracy, and number of operations measured by multiply-adds (MAdd), as well as the number of parameters

PDF Abstract

MobileNetV2: Inverted Residuals and Linear Bottlenecks

The paper "MobileNetV2: Inverted Residuals and Linear Bottlenecks" by Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen presents a novel neural network architecture specifically optimized for mobile and resource-constrained environments. MobileNetV2 builds on the foundations of its predecessor, MobileNetV1, but introduces significant innovations such as the inverted residual structure combined with linear bottlenecks. These enhancements lead to notable improvements in both efficiency and accuracy for various computer vision tasks.

Key Innovations

Inverted Residuals with Linear Bottlenecks:

The cornerstone of MobileNetV2 is the introduction of the inverted residual module with linear bottlenecks. Unlike traditional residual connections which link layers with high-dimensional representations, the inverted residuals connect thin bottleneck layers. This approach allows for a significant reduction in the number of operations and memory usage without sacrificing accuracy. The structure consists of three stages:

Expansion from a low-dimensional compressed representation using lightweight depthwise convolutions.
Application of a non-linear activation function.
Projection back to a low-dimensional representation with a linear convolution.

Removal of Non-linearity in Narrow Layers:

A crucial insight of the paper is the removal of non-linearities in the narrow layers of the network. This design choice helps in retaining the representational power of the network and prevents the loss of information that would otherwise result from non-linear activations in these layers.

Empirical Evaluation and Performance

MobileNetV2 was evaluated on several benchmarks, including ImageNet classification, COCO object detection, and PASCAL VOC image segmentation. The results demonstrate clear performance advantages over several state-of-the-art models across multiple performance points, particularly for mobile applications where computational efficiency is paramount.

ImageNet Classification:

MobileNetV2 achieves superior performance with significantly fewer parameters and multiply-add operations compared to both MobileNetV1 and other competitive models like ShuffleNet and NASNet-A. For instance, a MobileNetV2 model with 3.4 million parameters delivers a top-1 accuracy of 72% on ImageNet, performing comparably to larger models while being computationally more efficient.

Object Detection:

For object detection, MobileNetV2 was integrated into a modified Single Shot Detector (SSD) framework, referred to as SSDLite. This modification replaces the standard convolutions with separable convolutions in the prediction layers. MobileNetV2 combined with SSDLite achieves competitive accuracy on the COCO dataset with a fraction of the computational cost of models like YOLOv2. Specifically, it is reported to have 20 times fewer operations and 10 times fewer parameters than YOLOv2 while achieving similar mAP.

Semantic Segmentation:

In semantic segmentation tasks, MobileNetV2 serves as the backbone for DeepLabv3, yielding strong performance on the PASCAL VOC 2012 dataset. When configured optimally, MobileNetV2 achieves a good balance between computational cost and accuracy, providing an effective solution for mobile semantic segmentation applications.

Implications and Future Directions

The proposed architecture of MobileNetV2 has several practical and theoretical implications:

Practical Implications:

Memory Efficiency: The inverted residual blocks significantly reduce memory overhead during inference, which is critical for mobile devices where memory is a constrained resource.
Operational Efficiency: The architecture allows for efficient implementation using standard operations available in modern deep learning frameworks, making it easily deployable.

Theoretical Implications:

Decoupling Capacity and Expressiveness: The separation introduced by the inverted residual structure allows for a detailed understanding of the network's capacity (input/output bottlenecks) and expressiveness (expansion layers).
Empirical Validation of Manifold Assumptions: The insights provided into the behavior of ReLU activations and their impact on layer transformations offer a pathway to more refined designs of neural networks.

Conclusion

MobileNetV2 represents a significant advancement in the design of efficient neural networks for mobile applications. The introduction of inverted residuals and linear bottlenecks achieves an optimal balance between performance and resource consumption. Future research could explore further decoupling of network expressiveness and capacity as well as investigate more complex architectural modifications. Additionally, the empirical insights gained from this work could inform the development of more robust and efficient models for a broad range of AI applications.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Mark Sandler (66 papers)
Andrew Howard (59 papers)
Menglong Zhu (18 papers)
Andrey Zhmoginov (27 papers)
Liang-Chieh Chen (66 papers)

Citations (17,510)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/giffmana/status/1830894925602255274

YouTube

Show All Videos