Mobile Video Object Detection with Temporally-Aware Feature Maps (1711.06368v2)

Published 17 Nov 2017 in cs.CV

Abstract: This paper introduces an online model for object detection in videos designed to run in real-time on low-powered mobile and embedded devices. Our approach combines fast single-image object detection with convolutional long short term memory (LSTM) layers to create an interweaved recurrent-convolutional architecture. Additionally, we propose an efficient Bottleneck-LSTM layer that significantly reduces computational cost compared to regular LSTMs. Our network achieves temporal awareness by using Bottleneck-LSTMs to refine and propagate feature maps across frames. This approach is substantially faster than existing detection methods in video, outperforming the fastest single-frame models in model size and computational cost while attaining accuracy comparable to much more expensive single-frame models on the Imagenet VID 2015 dataset. Our model reaches a real-time inference speed of up to 15 FPS on a mobile CPU.

Citations (192)

View on Semantic Scholar

Summary

The paper presents a novel recurrent-convolutional network that integrates convolutional LSTM layers within an SSD framework to harness temporal cues for improved video object detection.
The proposed Bottleneck-LSTM layer and extended width multiplier design reduce computational cost while achieving real-time speeds of up to 15 FPS on mobile CPUs.
This efficient architecture broadens mobile application potential by enabling continuous object detection for augmented reality, robotics, and real-time analytics.

Overview of Mobile Video Object Detection with Temporally-Aware Feature Maps

The paper "Mobile Video Object Detection with Temporally-Aware Feature Maps" by Mason Liu and Menglong Zhu addresses the challenge of real-time object detection in video sequences on resource-constrained mobile and embedded devices. The authors propose a novel interweaved recurrent-convolutional network architecture combining single-image object detection with convolutional Long Short-Term Memory (LSTM) layers to leverage temporal cues inherent in video.

Architectural Innovations

The paper introduces key innovations in network architecture:

Integration of Convolutional LSTMs: The authors integrate convolutional LSTM layers within the SSD (Single Shot MultiBox Detector) framework to enhance temporal context while preserving efficiency. This integration allows the network to refine feature maps as it processes consecutive video frames.
Bottleneck-LSTM Layer: To reduce computational demands, a new Bottleneck-LSTM layer is proposed, employing depthwise separable convolutions and bottleneck design principles in the recurrent layers. This modification substantially decreases the computational cost compared to traditional LSTMs, making the network feasible for deployment on mobile devices.
Extended Width Multiplier Design: A nuanced adaptation of the channel width multiplier is introduced, refining the network's dimensionality control across different component layers—base, SSD, and LSTM—to optimize computational efficiency.

Performance and Comparison

The proposed architecture's performance is evaluated on the Imagenet VID 2015 dataset. Remarkably, the model achieves real-time inference speeds of up to 15 FPS on mobile CPUs, surpassing the efficiency of existing single-frame models. Practical implementation results demonstrate significant accuracy improvements over generic single-frame architectures like MobileNet-SSD at comparable computational budgets.

Results:

Accuracy: The mobile application of convolutional LSTM layers achieves accuracy levels comparable to more computationally intensive single-frame models.
Efficiency: The model boasts a reduced parameter count and fewer multiply-add operations compared to its peers, confirming the efficacy of the Bottleneck-LSTM.

Implications

The development of such an architecture holds both practical and theoretical significance:

Practical Impact: The ability for real-time processing on mobile devices opens new avenues for applications needing continuous object detection, such as augmented reality, robotics, and real-time video analytics.
Theoretical Contributions: The introduction of temporally-aware feature maps provides valuable insights into enhancing object detection models with sequentially dependent data.

Future Directions

This work sets the groundwork for further exploration in video-based applications of efficient neural networks. Potential avenues for future research include:

Advanced Temporal Integration: Extending temporal architectures to capture longer sequences or integrate multi-frame context with higher precision.
Broader Application Spectrum: Transferring the principles to other domains requiring temporal feature exploitation, such as activity recognition or anomaly detection in surveillance.

In conclusion, the paper presents a robust framework, enhancing video object detection efficiency while leveraging temporal continuity. The methodological advancements and experimental insights contribute substantially to the domain of mobile and embedded AI applications.

PDF Markdown