ThunderNet: Towards Real-time Generic Object Detection (1903.11752v3)

Published 28 Mar 2019 in cs.CV

Abstract: Real-time generic object detection on mobile platforms is a crucial but challenging computer vision task. However, previous CNN-based detectors suffer from enormous computational cost, which hinders them from real-time inference in computation-constrained scenarios. In this paper, we investigate the effectiveness of two-stage detectors in real-time generic detection and propose a lightweight two-stage detector named ThunderNet. In the backbone part, we analyze the drawbacks in previous lightweight backbones and present a lightweight backbone designed for object detection. In the detection part, we exploit an extremely efficient RPN and detection head design. To generate more discriminative feature representation, we design two efficient architecture blocks, Context Enhancement Module and Spatial Attention Module. At last, we investigate the balance between the input resolution, the backbone, and the detection head. Compared with lightweight one-stage detectors, ThunderNet achieves superior performance with only 40% of the computational cost on PASCAL VOC and COCO benchmarks. Without bells and whistles, our model runs at 24.1 fps on an ARM-based device. To the best of our knowledge, this is the first real-time detector reported on ARM platforms. Our code and models are available at \url{https://github.com/qinzheng93/ThunderNet}.

Authors (7)

Zheng Qin (58 papers)
Zeming Li (53 papers)
Zhaoning Zhang (11 papers)
Yiping Bao (8 papers)
Gang Yu (114 papers)
Yuxing Peng (22 papers)
Jian Sun (415 papers)

Citations (252)

View on Semantic Scholar

Summary

Review of ThunderNet: Towards Real-time Generic Object Detection on Mobile Devices

The paper "ThunderNet: Towards Real-time Generic Object Detection on Mobile Devices" addresses a critical challenge in computer vision - the real-time detection of objects on mobile platforms, which are constrained in terms of computational resources. The research suggests an innovative approach through a two-stage lightweight detection framework that resonates significantly with the needs of mobile environments.

Methodology and Architectural Innovations

The paper challenges the conventional reliance on one-stage detectors for mobile or real-time applications, which, while efficient, often trade off accuracy due to their coarse predictions. Instead, ThunderNet proposes a lightweight two-stage detector architecture designed explicitly for object detection. This includes developing a specialized lightweight backbone named SNet, derived from ShuffleNetV2 and incorporating 5x5 depthwise convolutions to expand the receptive field without incurring substantial computational overhead.

The detection phase of ThunderNet utilizes a compact Region Proposal Network (RPN) and an optimized detection head. These enhancements are specifically focused on overcoming the imbalances seen in prior models, where a minimal backbone might be paired with a computationally intense detection head, leading to overfitting and inefficiencies.

Key to ThunderNet's architecture are two novel modules: the Context Enhancement Module (CEM) and Spatial Attention Module (SAM). CEM integrates multi-scale features to augment the receptive field and produce richer feature representations, while SAM leverages RPN's insights to refine spatial feature distribution, accentuating foreground features and suppressing background noise.

Performance Metrics and Results

ThunderNet exhibits superior performance compared to precedent lightweight one-stage detectors. On benchmark datasets such as PASCAL VOC and MS COCO, it reports notable achievements in balancing accuracy and efficiency. In particular, ThunderNet achieves real-time operation at 24.1 frames per second on ARM-based devices with 19.2 AP on COCO—a significant milestone considering the computational constraints inherent to mobile platforms.

ThunderNet's architecture demonstrates competitive advantage, delivering MobileNet-SSD level precision while utilizing only 22% of the FLOPs. This efficiency gain is crucial as it not only matches but, in some cases, outperforms state-of-the-art one-stage detectors under similar complexity conditions, confirming the viability of two-stage detection in resource-constrained environments.

Implications and Future Prospects

The implications of this research are twofold. Practically, it sets a precedent for deploying high-performance object detection algorithms on mobile platforms efficiently, emphasizing the potential expansion of applications where low-latency object detection is needed, such as augmented reality and autonomous navigation in resource-limited settings. Theoretically, this work enriches the dialogue around lightweight network architectures, proposing new directions for enhancing feature representation in computationally constrained networks.

Future explorations might explore optimizing the trade-offs between input resolution, network depth, and detection head complexity to further refine detector efficiency. Additionally, the integration of more advanced hardware-specific optimizations could further propel ThunderNet or similar architectures into even more ubiquitous deployments across diverse mobile and embedded systems.

In conclusion, the paper advances the conversation on lightweight two-stage detectors by demonstrating the feasibility and advantages of ThunderNet. By addressing the computational cost without compromising accuracy, it exemplifies a balanced approach to real-time object detection on mobile platforms.

PDF Markdown

Related Papers

Find Related Papers