Disentangle Your Dense Object Detector (2107.02963v2)

Published 7 Jul 2021 in cs.CV

Abstract: Deep learning-based dense object detectors have achieved great success in the past few years and have been applied to numerous multimedia applications such as video understanding. However, the current training pipeline for dense detectors is compromised to lots of conjunctions that may not hold. In this paper, we investigate three such important conjunctions: 1) only samples assigned as positive in classification head are used to train the regression head; 2) classification and regression share the same input feature and computational fields defined by the parallel head architecture; and 3) samples distributed in different feature pyramid layers are treated equally when computing the loss. We first carry out a series of pilot experiments to show disentangling such conjunctions can lead to persistent performance improvement. Then, based on these findings, we propose Disentangled Dense Object Detector (DDOD), in which simple and effective disentanglement mechanisms are designed and integrated into the current state-of-the-art dense object detectors. Extensive experiments on MS COCO benchmark show that our approach can lead to 2.0 mAP, 2.4 mAP and 2.2 mAP absolute improvements on RetinaNet, FCOS, and ATSS baselines with negligible extra overhead. Notably, our best model reaches 55.0 mAP on the COCO test-dev set and 93.5 AP on the hard subset of WIDER FACE, achieving new state-of-the-art performance on these two competitive benchmarks. Code is available at https://github.com/zehuichen123/DDOD.

Citations (121)

View on Semantic Scholar

Summary

Disentangle Your Dense Object Detector

The paper "Disentangle Your Dense Object Detector" presents an examination of the common conjunctions present in the training pipelines of modern dense object detectors. These dense detectors, typically characterized by a single-stage architecture, have shown significant efficacy in object detection, with applications ranging from video processing to real-time multimedia interfaces. However, the authors identify certain entrenched assumptions within these frameworks that may hinder optimal performance, and they propose novel mechanisms for disentangling these aspects to achieve performance improvements.

Key Contributions

Disentanglement of Training Conjunctions: The authors have pinpointed three specific conjunctions that are typically observed during the training of dense object detectors:
- Label Assignment Conjunction: The prevailing approach restricts regression loss computation to positive samples in the classification phase, based solely on the IoU threshold. This paper challenges this notion, proposing distinct label assignment strategies for classification and regression tasks.
- Spatial Feature Conjunction: The paper notes that classification and regression tasks, which inherently seek different features within an image, incorrectly share identical receptive fields due to parallel head architectures. A spatial feature disentanglement method utilizing deformable convolutions is suggested, allowing task-specific optimal feature extraction.
- Pyramid Supervision Conjunction: Standard practice computes loss uniformly across feature pyramid networks, ignoring sample imbalance across layers. The paper introduces a novel gradient re-weighting method to address this imbalance.
Proposed Disentangled Dense Object Detector (DDOD): Building upon these insights, the paper proposes the DDOD framework. Within this framework, disentanglement strategies are integrated into state-of-the-art dense detectors like RetinaNet, FCOS, and ATSS.

Experimental Findings

The DDOD framework was thoroughly evaluated on dense object detection baselines, such as RetinaNet, FCOS, and ATSS. It demonstrated absolute improvements of up to 2.4 mAP on MS COCO, highlighting the effectiveness of the disentanglement strategies. Specifically notable was the improved performance on large objects metrics, indicative of the pyramid supervision disentanglement's efficacy.

Further tests on WIDER FACE datasets using DDOD verified its extends applicability to domain-specific object detection tasks, securing competitive state-of-the-art performance.

Implications and Future Directions

The disentangling mechanisms suggested in DDOD are simple yet effective, ensuring emphatic improvements while maintaining computational efficiency. By unifying them within a single framework, DDOD has positioned itself as a promising alternative for enhancing dense object detection accuracy. In the broader context of AI, these results inspire further research into refining other aspects of object detection strategies, particularly those associated with computational paradigms and task-specific feature learning.

The potential applicability of DDOD to two-stage object detectors represents a prospective advancement in further unifying classification and regression tasks to operate optimally in object detection frameworks. The prospects of theoretical enhancements in feature disentanglement may also stimulate innovation in image comprehension scopes beyond detection.

Conclusively, DDOD invites a reevaluation of established practices within dense object detection pipelines, promoting a forward-looking approach to balance accuracy with efficiency in AI-driven object detection applications.

Related Papers

GitHub

GitHub - zehuichen123/DDOD: [ACM MM 2021, TMM 2023] Disentangle your Dense Object Detector (61 stars)