Pelee: A Real-Time Object Detection System on Mobile Devices (1804.06882v3)

Published 18 Apr 2018 in cs.CV

Abstract: An increasing need of running Convolutional Neural Network (CNN) models on mobile devices with limited computing power and memory resource encourages studies on efficient model design. A number of efficient architectures have been proposed in recent years, for example, MobileNet, ShuffleNet, and MobileNetV2. However, all these models are heavily dependent on depthwise separable convolution which lacks efficient implementation in most deep learning frameworks. In this study, we propose an efficient architecture named PeleeNet, which is built with conventional convolution instead. On ImageNet ILSVRC 2012 dataset, our proposed PeleeNet achieves a higher accuracy and over 1.8 times faster speed than MobileNet and MobileNetV2 on NVIDIA TX2. Meanwhile, PeleeNet is only 66% of the model size of MobileNet. We then propose a real-time object detection system by combining PeleeNet with Single Shot MultiBox Detector (SSD) method and optimizing the architecture for fast speed. Our proposed detection system2, named Pelee, achieves 76.4% mAP (mean average precision) on PASCAL VOC2007 and 22.4 mAP on MS COCO dataset at the speed of 23.6 FPS on iPhone 8 and 125 FPS on NVIDIA TX2. The result on COCO outperforms YOLOv2 in consideration of a higher precision, 13.6 times lower computational cost and 11.3 times smaller model size.

Citations (434)

View on Semantic Scholar

Summary

The paper introduces PeleeNet, an innovative CNN architecture that avoids depthwise separable convolutions for efficient mobile object detection.
It achieves 72.6% top-1 accuracy on ImageNet with only 508 million FLOPs and a 2.8 million parameter model, outperforming MobileNet.
Pelee integrates PeleeNet with an SSD framework, delivering 23.6 FPS on iPhone 8 and robust detection performance on PASCAL VOC and MS COCO.

Pelee: A Real-Time Object Detection System on Mobile Devices

The paper "Pelee: A Real-Time Object Detection System on Mobile Devices" presents a novel approach to enhance the efficiency of Convolutional Neural Networks (CNNs) for mobile device deployment. Given the stringent memory and computational constraints in mobile environments, the authors introduce PeleeNet, which improves upon the limitations of existing architectures like MobileNet and ShuffleNet by avoiding the depthwise separable convolutions that suffer from inefficient implementation across various deep learning frameworks.

PeleeNet adopts conventional convolutional layers while incorporating a number of key architectural innovations. These innovations include dual-path dense layers, improved bottleneck channel designs, and a transition layer that eschews compression. Furthermore, PeleeNet utilizes post-activation techniques allowing batch normalization layers to be merged with convolution layers, thereby enhancing inference speed.

The empirical evaluation of PeleeNet demonstrates competitive performance on the ImageNet ILSVRC 2012 benchmark, achieving a top-1 accuracy of 72.6% with only 508 million FLOPs and a model size of 2.8 million parameters. This represents a significant speedup compared to MobileNet, with the model being 66% in size relative to MobileNet, while running 1.8 times faster on NVIDIA's TX2 platform.

Moreover, the paper extends PeleeNet into an object detection system through integration with a Single Shot MultiBox Detector (SSD). This system, named Pelee, utilizes optimized architecture for superior speed and accuracy balance, achieving 76.4% mean Average Precision (mAP) on PASCAL VOC 2007 and an mAP of 22.4 on MS COCO while operating at 23.6 FPS on iPhone 8 and 125 FPS on NVIDIA TX2.

Key Contributions:

Two-Way Dense Layer: PeleeNet exploits a dual-way dense layer inspired by GoogLeNet to leverage multiple scales of receptive fields, enhancing its ability to detect large objects.
Stem Block and Bottleneck Channels: Inspired by DSOD, a stem block is positioned before the first dense layer to bolster feature representation. The bottleneck layer's channel count adapts dynamically to inputs rather than maintaining a fixed expansion factor, yielding up to 28.5% computational savings with negligible accuracy loss.
Transition and Composite Layers: The transition layer avoids compression, retaining feature representation strength, and composite layers utilize post-activation for increased inference speed.

The integration of these components is systematically assessed via ablation studies, showcasing the superiority of PeleeNet over DenseNet baselines on the Stanford Dogs dataset, demonstrating higher accuracy with reduced computational demand.

Implications:

The research signifies a meaningful development in the field of efficient CNN architectures, showcasing that alternatives to depthwise separable convolutions can deliver compelling results on constrained hardware. This opens opportunities for real-time deployment of CNNs in mobile contexts, facilitating more accessible and swift application across scenarios such as autonomous driving, robotics, and mobile application frameworks where computational budget and energy consumption are critical factors.

Future Outlook:

The deployment of Pelee on embedded devices suggests a viable path forward for advancing the practical implications of AI on resource-constrained platforms. Future research could further streamline architecture designs, extend model capabilities for broader applications, and optimize integration with diverse hardware accelerators. The results highlight the potential for more efficient, scalable, and adaptable neural network technologies designed for real-world mobile deployment.

PDF Markdown