Flow-Guided Feature Aggregation for Video Object Detection (1703.10025v2)

Published 29 Mar 2017 in cs.CV

Abstract: Extending state-of-the-art object detectors from image to video is challenging. The accuracy of detection suffers from degenerated object appearances in videos, e.g., motion blur, video defocus, rare poses, etc. Existing work attempts to exploit temporal information on box level, but such methods are not trained end-to-end. We present flow-guided feature aggregation, an accurate and end-to-end learning framework for video object detection. It leverages temporal coherence on feature level instead. It improves the per-frame features by aggregation of nearby features along the motion paths, and thus improves the video recognition accuracy. Our method significantly improves upon strong single-frame baselines in ImageNet VID, especially for more challenging fast moving objects. Our framework is principled, and on par with the best engineered systems winning the ImageNet VID challenges 2016, without additional bells-and-whistles. The proposed method, together with Deep Feature Flow, powered the winning entry of ImageNet VID challenges 2017. The code is available at https://github.com/msracver/Flow-Guided-Feature-Aggregation.

PDF Abstract

Flow-Guided Feature Aggregation for Video Object Detection

The paper "Flow-Guided Feature Aggregation for Video Object Detection" addresses the challenge of extending object detection capabilities from still images to video sequences. The primary difficulty in this domain arises from the deterioration of object appearances due to motion blur, defocus, and rare poses present in video data. Traditional approaches have struggled with these problems, often relying on temporal information at the bounding box level without achieving end-to-end training. This paper introduces a novel method called Flow-Guided Feature Aggregation (FGFA), which improves video object detection by leveraging temporal coherence at the feature level.

Methodology and Framework

FGFA seeks to harness the redundant temporal information within video frames to enhance feature quality for each frame. The process involves two critical components: motion-guided spatial warping and feature aggregation. The motion between frames is estimated using an optical flow network, allowing the alignment of spatial features through feature warping. Once aligned, these features are aggregated with adaptive weights that emphasize frames contributing beneficial information, particularly improving the detection of fast-moving objects. This process is fully differentiable and end-to-end trainable, setting it apart from conventional methods.

Numerical Results and Improvements

The FGFA framework demonstrates strong results when tested on the ImageNet VID dataset. Quantitative evaluation shows significant improvement in mean Average Precision (mAP) scores compared to strong single-frame baselines. For instance, using a ResNet-101 feature network, FGFA achieves an overall mAP of 76.3%, with a notable enhancement for fast-moving objects (57.6% mAP) compared to the 51.4% mAP of the baseline.

By employing adaptive weighting mechanisms and optical flow guidance, the FGFA approach effectively enhances object detection performance across different motion categories—slow, medium, and fast. Such enhancements point to the capabilities of FGFA in bringing higher quality bounding boxes and improved per-frame recognition.

Implications and Future Directions

The FGFA methodology advances the approach to video object detection by emphasizing feature quality and employing an end-to-end learning framework. This principled approach distinguishes itself by considering the aggregation of feature-level information rather than relying solely on post-processing techniques at the bounding box level.

Practically, this method could revolutionize real-time video analysis applications where accurate detection and tracking of objects in motion are crucial, such as surveillance, autonomous driving, and activity recognition. Theoretically, it opens new avenues for research into more efficient and accurate video understanding models, with potential exploration towards optimizing flow networks and adaptive memory techniques for aggregation.

Future research may explore lightweight flow networks for improved computational efficiency, the inclusion of larger annotated datasets for enhanced model robustness, and advancements in adaptive memory mechanisms for even better performance gains. Such developments can further fortify the role of video object detection in emerging AI fields.