Looking Fast and Slow: Memory-Guided Mobile Video Object Detection (1903.10172v1)

Published 25 Mar 2019 in cs.CV

Abstract: With a single eye fixation lasting a fraction of a second, the human visual system is capable of forming a rich representation of a complex environment, reaching a holistic understanding which facilitates object recognition and detection. This phenomenon is known as recognizing the "gist" of the scene and is accomplished by relying on relevant prior knowledge. This paper addresses the analogous question of whether using memory in computer vision systems can not only improve the accuracy of object detection in video streams, but also reduce the computation time. By interleaving conventional feature extractors with extremely lightweight ones which only need to recognize the gist of the scene, we show that minimal computation is required to produce accurate detections when temporal memory is present. In addition, we show that the memory contains enough information for deploying reinforcement learning algorithms to learn an adaptive inference policy. Our model achieves state-of-the-art performance among mobile methods on the Imagenet VID 2015 dataset, while running at speeds of up to 70+ FPS on a Pixel 3 phone.

Citations (81)

View on Semantic Scholar

Summary

Memory-Guided Mobile Video Object Detection: An In-Depth Analysis

The paper "Looking Fast and Slow: Memory-Guided Mobile Video Object Detection" introduces a novel framework for video object detection that aims to optimize the performance on mobile devices, where computational resources are notably limited. The authors propose an interleaved approach utilizing both lightweight and conventional feature extractors, combined with a memory module, to efficiently handle video streams for object detection tasks.

Overview of Methodology

The human visual system's ability to recognize the "gist" of a scene quickly informs the guiding principles for this study. The authors leverage this concept by interleaving lightweight feature extractors, modeled to capture the general essence of frames quickly, with heavier feature extractors that provide detailed and robust features for more precise detection.

The mobile video detection model employs a memory module designed to fuse features from these disparate extractors into a unified representation, enhancing the information available at any given detection moment. This is achieved using a modified ConvLSTM layer that maintains spatial and temporal context across frames by constructing a dynamic and adaptive visual memory.

Key Numerical Results

The presented framework achieves state-of-the-art performance among mobile methods on the Imagenet VID 2015 dataset, with impressive speed benchmarks, reaching up to 70+ FPS on a Pixel 3 phone. This performance indicates both the efficacy and applicability of the proposed approach in real-time mobile environments.

Adaptive Inference Policy

An essential component of the study is the reinforcement learning-based adaptive inference policy, which optimizes the sequence of feature extractor executions. The policy utilizes Q-learning to analyze the current state of memory and decide which extractor to run subsequently, thereby balancing the speed and accuracy dynamically.

Implications and Future Directions

Practically, this approach provides significant advancements for deploying real-time object detection on mobile devices, potentially revolutionizing applications in automated surveillance, augmented reality, and mobile robotics, where resource constraints are a substantial concern.

Theoretically, the paper opens avenues for exploring memory-guided processes by integrating biological insights into artificial systems. The idea that lightweight, memory-informed extractors can perform complex tasks without significant computational costs challenges current paradigms in neural network design and may stimulate further research into adaptive memory architectures.

Future development could investigate advanced reinforcement learning techniques to refine adaptive policies further, as well as exploring the expansion of this framework to other video analysis tasks beyond object detection.

Conclusion

"Looking Fast and Slow: Memory-Guided Mobile Video Object Detection" stands as a pivotal contribution to computer vision for mobile environments. By effectively interleaving memory-informed extractors and reinforcing adaptive methodology, the authors have demonstrated an innovative path forward for resource-efficient and accurate real-time video object detection. This work not only advances practical capabilities but also enriches theoretical understanding, offering critical insights that could influence broader AI research trajectories.