Sequence Level Semantics Aggregation for Video Object Detection (1907.06390v2)

Published 15 Jul 2019 in cs.CV

Abstract: Video objection detection (VID) has been a rising research direction in recent years. A central issue of VID is the appearance degradation of video frames caused by fast motion. This problem is essentially ill-posed for a single frame. Therefore, aggregating features from other frames becomes a natural choice. Existing methods rely heavily on optical flow or recurrent neural networks for feature aggregation. However, these methods emphasize more on the temporally nearby frames. In this work, we argue that aggregating features in the full-sequence level will lead to more discriminative and robust features for video object detection. To achieve this goal, we devise a novel Sequence Level Semantics Aggregation (SELSA) module. We further demonstrate the close relationship between the proposed method and the classic spectral clustering method, providing a novel view for understanding the VID problem. We test the proposed method on the ImageNet VID and the EPIC KITCHENS dataset and achieve new state-of-the-art results. Our method does not need complicated postprocessing methods such as Seq-NMS or Tubelet rescoring, which keeps the pipeline simple and clean.

Authors (4)

Haiping Wu (16 papers)
Yuntao Chen (37 papers)
Naiyan Wang (65 papers)
Zhaoxiang Zhang (162 papers)

Citations (192)

View on Semantic Scholar

Summary

Sequence Level Semantics Aggregation for Video Object Detection

The pursuit of more robust and effective Video Object Detection (VID) methods is an ongoing challenge, particularly in the context of object or camera motions that result in significant appearance degradation across video frames. The paper "Sequence Level Semantics Aggregation for Video Object Detection" addresses this problem by proposing a novel approach to feature aggregation that leverages information at a sequence level rather than relying solely on temporally adjacent frames. This strategy is driven by the limitations of existing methods, which primarily depend on optical flow or recurrent neural networks (RNNs) for feature aggregation—a reliance that restricts feature assessment to a narrow temporal scope, potentially missing vital sequence-level information.

The Sequence Level Semantics Aggregation (SELSA) module is introduced as a key innovation of this paper. By framing video as a series of unordered frames, the module allows for aggregation of features based on semantic similarity across the full sequence, transcending the limitations imposed by temporal adjacency. The authors argue that incorporating sequence-level information fosters a more discriminative and robust feature set conducive to effective VID. SELSA further connects to the principles of spectral clustering, offering a fresh perspective on minimizing intra-class variance typically exacerbated by fast motion.

Experimentation with SELSA on ImageNet VID and the EPIC KITCHENS datasets yields new state-of-the-art results. On ImageNet VID, SELSA achieves notable improvements in mean Average Precision (mAP), especially in cases involving fast motion. Importantly, SELSA achieves these results without the necessity for complex post-processing methods such as Seq-NMS or tubelet rescoring, enhancing the simplicity and efficiency of the detection pipeline.

The analytical framework of SELSA reinterprets the VID task from a sequence-wide perspective. In doing so, it broadens the horizon of feature aggregation beyond the constraints of temporal contiguity. The approach considers the entire video sequence as a collection from which semantically relevant features can be extracted and aggregated, a departure from the trend of narrowly defined feature windows restricted by temporal flow or RNN architectures.

The implications of SELSA are twofold. Practically, this means VID systems can achieve improved accuracy—even in challenging conditions involving rapid motion—while maintaining computational efficiency. Theoretically, the paper emphasizes the importance of global contextual understanding in video frames, urging a shift away from sequentially constrained models.

It is expected that future developments in video-based object detection might incorporate aspects of SELSA-like architectures, focusing on holistic video representation rather than a series of discrete frames. The potential for reducing intra-class feature variance while maintaining or even improving generalization suggests SELSA’s adaptability to other video recognition tasks that demand continuity and coherence across frames.

In summary, the SELSA module exemplifies a forward-thinking step in video object detection, highlighting the efficacy of sequence level feature aggregation in overcoming the challenges posed by fast-moving scenes and object transformations. Subsequent research may expand on these principles, exploring optimized methods for sequence-wide feature integration and evaluating their efficacy across broader video datasets.

PDF Markdown

Related Papers

Find Related Papers