Temporal RoI Align for Video Object Recognition (2109.03495v2)

Published 8 Sep 2021 in cs.CV

Abstract: Video object detection is challenging in the presence of appearance deterioration in certain video frames. Therefore, it is a natural choice to aggregate temporal information from other frames of the same video into the current frame. However, RoI Align, as one of the most core procedures of video detectors, still remains extracting features from a single-frame feature map for proposals, making the extracted RoI features lack temporal information from videos. In this work, considering the features of the same object instance are highly similar among frames in a video, a novel Temporal RoI Align operator is proposed to extract features from other frames feature maps for current frame proposals by utilizing feature similarity. The proposed Temporal RoI Align operator can extract temporal information from the entire video for proposals. We integrate it into single-frame video detectors and other state-of-the-art video detectors, and conduct quantitative experiments to demonstrate that the proposed Temporal RoI Align operator can consistently and significantly boost the performance. Besides, the proposed Temporal RoI Align can also be applied into video instance segmentation. Codes are available at https://github.com/open-mmlab/mmtracking

Authors (8)

Tao Gong (35 papers)
Kai Chen (512 papers)
Xinjiang Wang (32 papers)
Qi Chu (53 papers)
Feng Zhu (140 papers)
Dahua Lin (336 papers)
Nenghai Yu (174 papers)
Huamin Feng (6 papers)

Citations (76)

View on Semantic Scholar

Summary

The paper introduces a novel Temporal ROI Align operator that aggregates multi-frame ROI features to enhance detection accuracy.
It employs a multi-head temporal attention mechanism to selectively integrate support frame information and mitigate issues like motion blur.
Experimental results on ImageNet VID show a boost from 74.0 to 80.5 mAP using ResNet-101, demonstrating robust performance gains.

Temporal ROI Align for Video Object Recognition

The paper "Temporal ROI Align for Video Object Recognition" presents a novel approach to enhancing video object detection by leveraging temporal information across frames using a technique called Temporal ROI Align. Video object recognition poses significant challenges due to appearance deterioration caused by factors such as motion blur, defocus, and occlusions. Existing methods primarily extract region of interest (ROI) features from single-frame feature maps, thereby missing out on rich temporal information.

Core Contribution

The authors propose a Temporal ROI Align operator aimed at improving feature extraction by incorporating temporal information from multiple frames in a video. This method takes advantage of the inherent similarity between features of the same object instance across consecutive frames. By identifying the most similar ROI features in support frames, and integrating them into the current frame proposals using a temporal attention mechanism, the Temporal ROI Align can capture detailed temporal dynamics.

Methodology and Key Features

MS ROI Align: The Most Similar ROI Align component extracts features from support frame maps for target frame proposals. This involves identifying and leveraging the most similar spatial location features across frames.
TAFA: Temporal Attentional Feature Aggregation processes the grouped most similar ROI features using a multi-head attention mechanism. This operation strategically weighs features from different frames, emphasizing clear over blurry instances.
Integration into Video Detectors: The Temporal ROI Align is integrated with both single-frame and state-of-the-art video detectors to consistently and significantly enhance performance.

Experimental Results

The proposed operator demonstrated its efficacy by achieving a substantial improvement in mAP across various experiments conducted on the ImageNet VID dataset. Notably, it enhanced performance from a baseline of 74.0 to 80.5 using a ResNet-101 backbone. Furthermore, when incorporated into the SELSA framework, it achieved 82.0 mAP, showing significant competitive advantage over existing methods.

Broader Implications

The Temporal ROI Align not only boosts performance in video object detection but also shows adaptability across other video-related tasks like video instance segmentation (VIS). Its successful application on complex datasets such as EPIC KITCHENS further underscores its robustness and utility in diverse and challenging environments.

Future Directions

Moving forward, the proposed method opens avenues for exploring broader applications in video-based tasks, such as multi-object tracking. The research suggests that expanding the utilization of temporal context could enhance models' understanding of object dynamics in various video datasets. Additionally, further exploration into the optimal harmonization of temporal and spatial information could yield even more sophisticated and efficient models.

In summary, the paper introduces an innovative approach to leveraging temporal information in video object recognition, providing a significant performance boost and paving the way for future advancements in video analysis tasks.

PDF Markdown

Related Papers

GitHub

GitHub - open-mmlab/mmtracking: OpenMMLab Video Perception Toolbox. It supports Video Object Detection (VID), Multiple Object Tracking (MOT), Single Object Tracking (SOT), Video Instance Segmentation (VIS) with a unified framework. (3,548 stars)