Learning Where to Focus for Efficient Video Object Detection (1911.05253v2)

Published 13 Nov 2019 in cs.CV

Abstract: Transferring existing image-based detectors to the video is non-trivial since the quality of frames is always deteriorated by part occlusion, rare pose, and motion blur. Previous approaches exploit to propagate and aggregate features across video frames by using optical flow-warping. However, directly applying image-level optical flow onto the high-level features might not establish accurate spatial correspondences. Therefore, a novel module called Learnable Spatio-Temporal Sampling (LSTS) has been proposed to learn semantic-level correspondences among adjacent frame features accurately. The sampled locations are first randomly initialized, then updated iteratively to find better spatial correspondences guided by detection supervision progressively. Besides, Sparsely Recursive Feature Updating (SRFU) module and Dense Feature Aggregation (DFA) module are also introduced to model temporal relations and enhance per-frame features, respectively. Without bells and whistles, the proposed method achieves state-of-the-art performance on the ImageNet VID dataset with less computational complexity and real-time speed. Code will be made available at https://github.com/jiangzhengkai/LSTS.

Citations (1)

View on Semantic Scholar

Summary

The paper introduces the Learnable Spatio-Temporal Sampling (LSTS) module to robustly align high-level features across video frames, enhancing detection accuracy.
It employs Sparse Recursive Feature Update (SRFU) to update temporal relations and Dense Feature Aggregation (DFA) to enrich non-keyframe details using keyframe data.
The framework achieves real-time speeds (21–23 FPS) on ImageNet VID while reducing model parameters from approximately 100M to 65M, ensuring computational efficiency.

Learning Where to Focus for Efficient Video Object Detection

The paper "Learning Where to Focus for Efficient Video Object Detection" addresses the complexities and inefficiencies inherent in adapting image-based object detection frameworks to video sequences. Object detection in video imposes additional challenges such as motion blur, object occlusion, and rare poses. Traditional methods, which focus on frame-by-frame analysis, fail to harness the potential of temporal information inherent in video sequences.

The main contribution of this paper is the introduction of a Learnable Spatio-Temporal Sampling (LSTS) module that endeavors to improve spatial correspondence learning across video frames. Unlike conventional optical flow-warping techniques, which apply pixel-level flow across frames, LSTS aligns high-level semantic features. This differentiation is crucial as it mitigates the inaccuracies that occur when pixel-level movements are mismapped to high-level feature shifts.

The technique for LSTS involves initializing sampled locations across feature maps, followed by iterative updates to refine their placement based on detection accuracy after aggregation, leading to a more precise alignment of features between frames. This learning approach stands in contrast to fixed patterns employed by optical flow methods and offers significant improvements in computational efficiency.

Additionally, the paper introduces the Sparse Recursive Feature Update (SRFU) and Dense Feature Aggregation (DFA) modules. SRFU updates temporal relations between keyframes, conserving computational resources by avoiding redundant processing, while DFA enhances per-frame features by employing high-level information from keyframes to improve the precision on non-keyframes.

These techniques enable the model to achieve state-of-the-art performance on the ImageNet VID dataset, surpassing previous methods in terms of both speed and model complexity. Specifically, the proposed framework achieves notable accuracy improvements with fewer model parameters and operates at real-time speeds, running on an average of 21-23 frames per second. This efficiency is crucial for applications where processing power and delay are constrained, such as autonomous vehicles and real-time surveillance systems.

The results indicate that replacing optical flow-warping with LSTS not only enhances detection accuracy but also significantly reduces computational overhead—in one configuration reducing model parameters from approximately 100M to about 65M. Moreover, the experimental outcomes assert that the learnable sampling locations outperform hand-crafted designs commonly used in the domain.

In conclusion, this work contributes to the video object detection field by presenting a novel approach that accurately aligns feature mappings across video frames, enhancing temporal relation modeling, and optimizing computational resources. Future developments could explore the integration of LSTS and associated modules into broader applications of video analysis, exploring how such modules could impact time-dependent tasks beyond object detection alone. Through this work, the potential for improved video analytics in constrained hardware environments and applications with real-time requirements is significantly enhanced.

PDF Markdown

Related Papers

GitHub

GitHub - jiangzhengkai/LSTS: [ECCV 2020] Learning Where to Focus for Efficient Video Object Detection (87 stars)