Practical Video Object Detection via Feature Selection and Aggregation (2407.19650v1)

Published 29 Jul 2024 in cs.CV

Abstract: Compared with still image object detection, video object detection (VOD) needs to particularly concern the high across-frame variation in object appearance, and the diverse deterioration in some frames. In principle, the detection in a certain frame of a video can benefit from information in other frames. Thus, how to effectively aggregate features across different frames is key to the target problem. Most of contemporary aggregation methods are tailored for two-stage detectors, suffering from high computational costs due to the dual-stage nature. On the other hand, although one-stage detectors have made continuous progress in handling static images, their applicability to VOD lacks sufficient exploration. To tackle the above issues, this study invents a very simple yet potent strategy of feature selection and aggregation, gaining significant accuracy at marginal computational expense. Concretely, for cutting the massive computation and memory consumption from the dense prediction characteristic of one-stage object detectors, we first condense candidate features from dense prediction maps. Then, the relationship between a target frame and its reference frames is evaluated to guide the aggregation. Comprehensive experiments and ablation studies are conducted to validate the efficacy of our design, and showcase its advantage over other cutting-edge VOD methods in both effectiveness and efficiency. Notably, our model reaches \emph{a new record performance, i.e., 92.9\% AP50 at over 30 FPS on the ImageNet VID dataset on a single 3090 GPU}, making it a compelling option for large-scale or real-time applications. The implementation is simple, and accessible at \url{https://github.com/YuHengsss/YOLOV}.

PDF HTML Abstract

Insights into Practical Video Object Detection via Feature Selection and Aggregation

The paper "Practical Video Object Detection via Feature Selection and Aggregation" presents an innovative approach to video object detection (VOD), focusing on the refinement of feature selection and aggregation techniques. The paper explores the challenges inherent in leveraging temporal information within video frames, which vary significantly due to factors such as motion and pose. Traditional video object detection methods typically involve two-stage detectors characterized by their high computational costs, largely resulting from their dual-stage nature. This work recognizes the untapped potential of one-stage detectors in the field of VOD, which have previously been underutilized despite their notable improvements within static image processing.

Methodological Approach

The authors propose a practical solution through a two-fold strategy. Firstly, they introduce a Feature Selection Module (FSM), strategically designed to address the dense prediction characteristic of one-stage detectors. The FSM effectively condenses the candidate features, thereby reducing the massive computation and memory expenses typical of these models. This involves filtering out low-quality predictions using a predetermined confidence threshold, which aligns with dense label assignment strategies commonly employed in one-stage detectors.

Secondly, the paper elaborates on a Feature Aggregation Module (FAM), underpinned by recent advancements in attention mechanisms. This module intricately evaluates the relationship between the target and reference frames, enhancing feature aggregation without incurring substantial computational overhead. The FAM integrates classification and IoU scores, forming an affinity matrix that guides the aggregation process. The approach circumvents the limitations of standard cosine similarity by incorporating a custom average pooling operator, which enhances the model's focus on relevant temporal information.

Evaluation and Results

Extensive empirical analysis and ablation studies underscore the efficiency and effectiveness of the proposed method. The results on the ImageNet VID dataset reveal notable improvements, with the model achieving 92.9% AP50 at over 30 FPS on a 3090 GPU, showcasing both high accuracy and fast inference capabilities. The paper also highlights how the combination of FSM and FAM elevates the model's performance, significantly outperforming other state-of-the-art VOD methodologies without relying on post-processing techniques.

Implications and Future Directions

The implications of this research extend to both theoretical and practical domains within artificial intelligence and computer vision. The ability to fuse efficiency with accuracy via one-stage detectors could inspire reconsideration of existing video processing workflows, particularly in large-scale applications and real-time scenarios. The introduction of a refined affinity matrix and average pooling mechanisms may prompt further exploration into how feature similarities can be better leveraged in multi-frame contexts.

Looking ahead, future investigations could delve into adaptive threshold mechanisms for feature selection, potentially broadening the applicability and robustness of the approach across diverse datasets. The paper's framework also opens avenues for improving generative models in video synthesis, video surveillance systems, and autonomous vehicle navigation, where quick and accurate object detection is paramount.

In summary, the authors adeptly navigate the complexities of video object detection, proposing a strategy that holds promise for advancing both the computational and application-oriented aspects of VOD, setting a new benchmark in the field.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Yuheng Shi (7 papers)
Tong Zhang (569 papers)
Xiaojie Guo (49 papers)

Citations (1)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - YuHengsss/YOLOV: This repo is an implementation of PyTorch version YOLOV Series (323 stars)