Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Revisiting the Sibling Head in Object Detector (2003.07540v1)

Published 17 Mar 2020 in cs.CV

Abstract: The ``shared head for classification and localization'' (sibling head), firstly denominated in Fast RCNN~\cite{girshick2015fast}, has been leading the fashion of the object detection community in the past five years. This paper provides the observation that the spatial misalignment between the two object functions in the sibling head can considerably hurt the training process, but this misalignment can be resolved by a very simple operator called task-aware spatial disentanglement (TSD). Considering the classification and regression, TSD decouples them from the spatial dimension by generating two disentangled proposals for them, which are estimated by the shared proposal. This is inspired by the natural insight that for one instance, the features in some salient area may have rich information for classification while these around the boundary may be good at bounding box regression. Surprisingly, this simple design can boost all backbones and models on both MS COCO and Google OpenImage consistently by ~3% mAP. Further, we propose a progressive constraint to enlarge the performance margin between the disentangled and the shared proposals, and gain ~1% more mAP. We show the \algname{} breaks through the upper bound of nowadays single-model detector by a large margin (mAP 49.4 with ResNet-101, 51.2 with SENet154), and is the core model of our 1st place solution on the Google OpenImage Challenge 2019.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Guanglu Song (45 papers)
  2. Yu Liu (786 papers)
  3. Xiaogang Wang (230 papers)
Citations (323)

Summary

  • The paper identifies spatial misalignment between classification and localization in sibling head designs and proposes a spatial disentanglement mechanism.
  • The introduced operator improves object detection performance by approximately 3% mAP on MS COCO with an additional 1% boost using a progressive constraint.
  • The findings highlight the importance of task-specific feature extraction, offering a practical enhancement for existing detectors without major architectural changes.

Revisiting the Sibling Head in Object Detection

The paper "Revisiting the Sibling Head in Object Detector" presents an empirical investigation into the role of the sibling head design in object detection architectures, particularly in the Fast R-CNN family. The sibling head, which shares parameters between classification and localization tasks, has been a widely used design in object detectors. However, the authors identify a critical issue—spatial misalignment between these two functions—hindering optimal performance. This misalignment arises because features beneficial for classification may not be equally suitable for precise bounding box localization.

To address this, the authors introduce a new operator, \algname{}, aimed at decoupling classification and regression tasks spatially. \algname{} generates two separate proposals for classification and localization by introducing a spatial disentanglement mechanism. This approach is built on the observation that different regions of an instance's feature map are better suited to specific tasks: some areas provide rich information for classification, while others near object boundaries yield better localization insights.

The results reported are compelling: the application of \algname{} leads to consistent improvements of approximately 3\% mAP on the MS COCO and Google OpenImage datasets. Further enhancements are achieved through a progressive constraint, yielding an additional 1\% mAP boost. This is particularly noteworthy, given the significant barrier traditionally encountered by single model detectors to outperform a certain performance threshold.

The implications of this work are multifaceted. Practically, it offers a straightforward yet effective means of enhancing the performance of existing object detection frameworks without the need for substantial architectural overhaul or increased computational burden. Theoretically, it reinforces the importance of task-specific features in neural network design, advocating for spatial disentanglement as a path to superior feature learning in multi-task frameworks.

In terms of future developments, this paper opens several avenues for further research. One potential direction is extending the disentanglement concept beyond spatial parameters, perhaps exploring the temporal or contextual field in video-based object detection tasks. Additionally, integrating \algname{} with newer neural architectures or leveraging it in conjunction with recent advancements such as Transformer models in vision might reveal further performance gains. The exploration of \algname{} in the context of compact mobile models also suggests possible extensions into resource-constrained environments.

Overall, this contribution provides an insightful perspective on a subtle yet significant problem in object detector design. By challenging the convention of parameter sharing across distinct tasks and providing a concrete solution, the paper contributes valuable knowledge that propels the field forward, laying groundwork for simpler yet more effective detector designs in complex vision applications.