Relation Distillation Networks for Video Object Detection (1908.09511v1)

Published 26 Aug 2019 in cs.CV

Abstract: It has been well recognized that modeling object-to-object relations would be helpful for object detection. Nevertheless, the problem is not trivial especially when exploring the interactions between objects to boost video object detectors. The difficulty originates from the aspect that reliable object relations in a video should depend on not only the objects in the present frame but also all the supportive objects extracted over a long range span of the video. In this paper, we introduce a new design to capture the interactions across the objects in spatio-temporal context. Specifically, we present Relation Distillation Networks (RDN) --- a new architecture that novelly aggregates and propagates object relation to augment object features for detection. Technically, object proposals are first generated via Region Proposal Networks (RPN). RDN then, on one hand, models object relation via multi-stage reasoning, and on the other, progressively distills relation through refining supportive object proposals with high objectness scores in a cascaded manner. The learnt relation verifies the efficacy on both improving object detection in each frame and box linking across frames. Extensive experiments are conducted on ImageNet VID dataset, and superior results are reported when comparing to state-of-the-art methods. More remarkably, our RDN achieves 81.8% and 83.2% mAP with ResNet-101 and ResNeXt-101, respectively. When further equipped with linking and rescoring, we obtain to-date the best reported mAP of 83.8% and 84.7%.

Citations (182)

View on Semantic Scholar

Summary

The paper introduces a novel two-stage framework that models object-object relations to boost video detection performance.
It uses stacked relation modules integrating geometric and appearance features to overcome challenges like occlusion and motion blur.
The method achieves up to 84.7% mAP on ImageNet VID, outperforming state-of-the-art approaches via refined proposal linking and rescoring.

Summary of "Relation Distillation Networks for Video Object Detection"

The paper introduces a novel architecture called Relation Distillation Networks (RDN) designed to advance video object detection by effectively modeling object-to-object relations across video frames. The primary focus is on enhancing object detection through a comprehensive understanding of object interactions within the spatio-temporal context of videos.

Key Contributions and Methodology

Video object detection extends the traditional image-based detection by incorporating temporal coherence, which presents unique challenges such as object occlusion and motion blur. Existing solutions predominantly enhance feature maps via box-level association or feature aggregation across frames using methods like optical flow. However, these approaches largely overlook the intricate dependencies between objects, which can be leveraged to improve detection performance.

RDN tackles this challenge by proposing a multi-stage reasoning framework for modeling object relations. The proposed method uses Region Proposal Networks (RPN) to generate object proposals from both the reference and support frames. The novelty of RDN lies in its two-stage structure which includes:

Basic Stage: The initial stage involves basic spatial reasoning to capture object-object interactions within the supportive proposals, irrespective of their validity. This is achieved using stacked relation modules that identify and enhance the relations between objects using both geometric and appearance features.
Advanced Stage: The subsequent stage refines these relations by focusing on high-objectness proposals. It distills relations through a cascade manner allowing a more selective enhancement of object proposals against refined supportive interactions. This staged approach reduces computational complexity and improves both per-frame accuracy and cross-frame consistency.

Results and Implications

The efficacy of RDN is evaluated on the ImageNet VID dataset, revealing outstanding performance with a mean Average Precision (mAP) of 81.8% when utilizing ResNet-101, and further improving to 83.2% with ResNeXt-101. These results surpassed contemporary state-of-the-art methods. Additionally, the architecture accommodates linking and rescoring techniques which elevated the mAP to 84.7%.

The successful deployment of RDN directly impacts video surveillance, autonomous navigation, and any domain relying on robust object detection in dynamic environments. The paper also raises the prospect for future research into exploring more sophisticated relation modeling and distillation techniques which could further advance AI applications in video analytics.

Conclusion

In summary, this research contributes a significant stride in enhancing video object detection through the novel framework of Relation Distillation Networks. By meticulously capturing and refining object interactions across frames, RDN showcases the potential of relational reasoning in pushing the limits of object detection performance. Further exploration into this domain could unlock more precise and computationally efficient solutions applicable to a wide range of machine perception problems.

PDF Markdown