MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications (1704.04861v1)

Published 17 Apr 2017 in cs.CV

Abstract: We present a class of efficient models called MobileNets for mobile and embedded vision applications. MobileNets are based on a streamlined architecture that uses depth-wise separable convolutions to build light weight deep neural networks. We introduce two simple global hyper-parameters that efficiently trade off between latency and accuracy. These hyper-parameters allow the model builder to choose the right sized model for their application based on the constraints of the problem. We present extensive experiments on resource and accuracy tradeoffs and show strong performance compared to other popular models on ImageNet classification. We then demonstrate the effectiveness of MobileNets across a wide range of applications and use cases including object detection, finegrain classification, face attributes and large scale geo-localization.

Citations (19,360)

View on Semantic Scholar

Summary

The paper’s main contribution is the use of depthwise separable convolutions, which drastically reduce computation with only a marginal accuracy drop.
It introduces adjustable width and resolution multipliers that enable flexible trade-offs between model complexity and performance for various mobile applications.
Experimental results reveal that MobileNets achieve competitive ImageNet accuracy while significantly reducing parameters and computational cost, ideal for resource-constrained devices.

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

The paper "MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications" introduces a novel yet pragmatically efficient architecture for deep neural networks, specifically tailored for mobile and embedded vision applications. This architecture, named MobileNets, capitalizes on depthwise separable convolutions to achieve a high-quality trade-off between model accuracy and computational efficiency.

Principle Contributions

The primary innovation of MobileNets lies in the utilization of depthwise separable convolutions, which decompose a standard convolution into two separate layers: a depthwise convolution for spatial filtering and a pointwise $1 \times 1$ convolution for channel-wise combination. This operation drastically reduces computational complexity and model parameters, preserving high accuracy while significantly curtailing latency.

Moreover, MobileNets introduce two hyper-parameters: width multiplier ( $\alpha$ ) and resolution multiplier ( $\rho$ ). The width multiplier uniformly thins the neural network layers, whereas the resolution multiplier reduces the input image resolution and correspondingly scales the internal representation. These parameters provide flexibility, enabling developers to tailor models in alignment with specific application constraints on computational resources and latency.

Experimental Results

The paper showcases a thorough evaluation of MobileNets on the ImageNet classification benchmark and various other applications, evidencing their versatility and effectiveness. The key experimental insights are as follows:

Depthwise Separable Convolutions: Employing these convolutions achieves a marginal reduction in accuracy (approximately 1%) but with a significant reduction in both Mult-Adds and parameters (569 million Mult-Adds and 4.2 million parameters) as opposed to standard convolutions (4866 million Mult-Adds and 29.3 million parameters).
Model Shrinking: The paper evaluates the models using the width and resolution multipliers. For instance, with $\alpha=0.5$ , the mobile network achieves a 63.7% accuracy, utilizing merely 149 million Mult-Adds.
Comparison to Established Models: The full-resolution MobileNet model demonstrates competitive performance (70.6% top-1 accuracy on ImageNet), rivaling heavyweights like VGG-16 while being substantially more efficient—32 times smaller in terms of parameters and 27 times less computation-intensive.

Applications

Beyond classification on ImageNet, MobileNets exhibit robust performance across various tasks, including:

Fine-grained Recognition: Close to state-of-the-art accuracy on the Stanford Dogs dataset, demonstrating its applicability in specialized classification tasks.
Large-scale Geo-localization: In the PlaNet model, MobileNets achieve competitive geolocalization performance while being much more compact and computationally efficient.
Face Attributes and Embedding Tasks: Utilizing distillation techniques, MobileNets successfully replicate the performance of larger face attribute classifiers and embedded models like FaceNet, significantly reducing required computational resources.
Object Detection: The effectiveness of MobileNets extends to object detection systems such as SSD and Faster-RCNN frameworks, where it performs competitively with significantly reduced computational overhead and model parameters.

Theoretical and Practical Implications

The theoretical contributions of the MobileNet architecture establish a foundation for continued exploration into model efficiency. By decoupling spatial and channel-wise convolutions, MobileNets advance the state-of-the-art in compact neural networks, emphasizing the balance between computational rigor and operational performance.

Practically, MobileNets hold substantial implications for deploying advanced neural networks on mobile and edge devices. The ability to tailor models via width and resolution multipliers ensures compatibility with the varying computational capacities of such devices. This paradigm introduces new opportunities for on-device intelligence, facilitating applications in real-time vision tasks like augmented reality, autonomous navigation, and personalized user experiences while ensuring swift inference and battery conservation.

Future Directions

Continued research on MobileNets can probe into further optimization techniques, potentially exploring hybrid models that integrate other efficiency strategies such as pruning or quantization. Additionally, adaptive scaling based on real-time resource availability and application context is an intriguing direction, promising enhanced flexibility and robustness in diverse operational scenarios. These advancements may lead to more sophisticated models with incrementally diminishing compromises on accuracy, establishing new benchmarks for mobile and embedded application frameworks.

In conclusion, MobileNets present a pragmatic approach to deep learning for resource-constrained environments. Their balanced architecture fosters adoption across diverse vision applications, raising the standard for efficient neural network design.