Rethinking Classification and Localization for Object Detection (1904.06493v4)

Published 13 Apr 2019 in cs.CV

Abstract: Two head structures (i.e. fully connected head and convolution head) have been widely used in R-CNN based detectors for classification and localization tasks. However, there is a lack of understanding of how does these two head structures work for these two tasks. To address this issue, we perform a thorough analysis and find an interesting fact that the two head structures have opposite preferences towards the two tasks. Specifically, the fully connected head (fc-head) is more suitable for the classification task, while the convolution head (conv-head) is more suitable for the localization task. Furthermore, we examine the output feature maps of both heads and find that fc-head has more spatial sensitivity than conv-head. Thus, fc-head has more capability to distinguish a complete object from part of an object, but is not robust to regress the whole object. Based upon these findings, we propose a Double-Head method, which has a fully connected head focusing on classification and a convolution head for bounding box regression. Without bells and whistles, our method gains +3.5 and +2.8 AP on MS COCO dataset from Feature Pyramid Network (FPN) baselines with ResNet-50 and ResNet-101 backbones, respectively.

Authors (7)

Yue Wu (339 papers)
Yinpeng Chen (55 papers)
Lu Yuan (130 papers)
Zicheng Liu (153 papers)
Lijuan Wang (133 papers)
Hongzhi Li (4 papers)
Yun Fu (131 papers)

Citations (464)

View on Semantic Scholar

Summary

Analyzing Dual-Head Structures in Object Detection

Object detection, a fundamental task in computer vision, often relies on accurate classification and precise localization. The paper "Rethinking Classification and Localization for Object Detection" by Yue Wu et al. offers an insightful analysis and a novel approach to improve object detection by advocating for a Dual-Head structure.

Core Findings

The authors begin by exploring two prevalent head structures in R-CNN-based detectors: the fully connected head (fc-head) and the convolution head (conv-head). Their analysis reveals that these structures exhibit fundamentally different strengths. The fc-head demonstrates an aptitude for classification, while the conv-head excels in localization. This dual specialization arises from the inherent spatial sensitivity of fc-heads and the robust regression capabilities of conv-heads.

The Double-Head Methodology

Leveraging the complementary strengths of both head structures, the paper introduces the Double-Head model. This architecture assigns classification tasks to the fc-head and bounding box regression tasks to the conv-head. Results clearly indicate the efficacy of this split. The Double-Head design improves Average Precision (AP) by 3.5 and 2.8 points on the MS COCO dataset using ResNet-50 and ResNet-101 backbones, respectively.

Detailed Analysis

The exploration involves various comparative analyses:

Classification Evaluation: The analysis shows higher correlation between fc-head classification scores and the IoU of proposals with ground truth, suggesting greater spatial sensitivity and accurate object differentiation.
Localization Evaluation: The conv-head outperforms in bounding box regression, providing more precise localization than the fc-head.

Numerical Results and Implications

A prominent numerical result is the substantial increase in AP on demanding benchmarks, reflecting on the practical improvements achievable through the Double-Head design. The paper also examines an extension, Double-Head-Ext, which includes unfocused tasks and beneficially incorporates supervision and complementary fusion from both heads.

Broader Implications

The Dual-Head approach introduces a new dimension in designing object detectors, emphasizing task specialization in detector heads. This insight has potential implications for future advancements in AI, offering a pathway to more refined and modular detection systems.

Future Directions

This paper opens the door for further investigation into hybrid architectures, potentially integrating other neural structures or learning paradigms. Future work might explore broader applications in real-time detection or multi-task learning, pushing the boundaries of head specialization further within AI frameworks.

In summary, this paper provides a critical contribution to object detection methodologies by advocating a strategic division of labor between classification and localization tasks. It stands as a valuable resource for researchers seeking to enhance detection efficiency and accuracy in complex visual environments.

PDF Markdown