RelationNet++: Bridging Visual Representations for Object Detection via Transformer Decoder (2010.15831v1)

Published 29 Oct 2020 in cs.CV

Abstract: Existing object detection frameworks are usually built on a single format of object/part representation, i.e., anchor/proposal rectangle boxes in RetinaNet and Faster R-CNN, center points in FCOS and RepPoints, and corner points in CornerNet. While these different representations usually drive the frameworks to perform well in different aspects, e.g., better classification or finer localization, it is in general difficult to combine these representations in a single framework to make good use of each strength, due to the heterogeneous or non-grid feature extraction by different representations. This paper presents an attention-based decoder module similar as that in Transformer~\cite{vaswani2017attention} to bridge other representations into a typical object detector built on a single representation format, in an end-to-end fashion. The other representations act as a set of \emph{key} instances to strengthen the main \emph{query} representation features in the vanilla detectors. Novel techniques are proposed towards efficient computation of the decoder module, including a \emph{key sampling} approach and a \emph{shared location embedding} approach. The proposed module is named \emph{bridging visual representations} (BVR). It can perform in-place and we demonstrate its broad effectiveness in bridging other representations into prevalent object detection frameworks, including RetinaNet, Faster R-CNN, FCOS and ATSS, where about $1.5\sim3.0$ AP improvements are achieved. In particular, we improve a state-of-the-art framework with a strong backbone by about $2.0$ AP, reaching $52.7$ AP on COCO test-dev. The resulting network is named RelationNet++. The code will be available at https://github.com/microsoft/RelationNet2.

Citations (61)

View on Semantic Scholar

Summary

The paper introduces a novel Bridging Visual Representations (BVR) module that integrates heterogeneous object detection features.
It employs Transformer-inspired attention mechanisms with efficient key sampling and shared location embeddings to optimize computation.
It achieves significant performance gains, improving detection AP by 1.5-3.0 points and reaching 52.7 AP on the COCO test-dev benchmark.

Overview of RelationNet++: Bridging Visual Representations for Object Detection via Transformer Decoder

The paper "RelationNet++: Bridging Visual Representations for Object Detection via Transformer Decoder" presents RelationNet++, an innovative approach to object detection that seeks to harmonize multiple object representation formats within a single framework. Traditionally, object detection frameworks have been built around specific representations like anchor boxes in RetinaNet or center points in FCOS. Each representation offers distinct advantages, but combining them in a singular framework is challenging due to their inherent heterogeneity.

RelationNet++ introduces a novel attention-based decoder inspired by the Transformer architecture, intending to integrate various representations to enhance a conventional object detector. This approach introduces a module named Bridging Visual Representations (BVR) which functions by strengthening the primary query representation features using diverse key instances from other visual representations. The BVR is computationally optimized with techniques such as key sampling and shared location embedding.

In quantitative terms, integration of the BVR module into prevalent object detection frameworks—specifically RetinaNet, Faster R-CNN, FCOS, and ATSS—yielded improvements of about 1.5 to 3.0 AP. The implementation of this module in conjunction with a robust backbone on a state-of-the-art framework led to an increase of approximately 2.0 AP, achieving a notable 52.7 AP on the COCO test-dev benchmark.

Key Contributions

Generalized Module for Bridging Representations: The paper introduces a versatile BVR module that facilitates the integration of heterogeneous visual representations, thereby amalgamating their strengths without disrupting the primary processing succeeded by the main representation format.
Efficient Computation Techniques: To address computational constraints, innovations such as the key sampling approach and shared relative location embedding are crafted, markedly enhancing the efficiency of the model.
Empirical Advancements: Experiments reveal that the BVR module significantly advances performance across multiple object detection architectures, making it broadly applicable.

Implications and Future Directions

The theoretical implication centers on advancing Transformer-like architectures to facilitate feature enhancement across heterogeneous information domains. Practically, the application of RelationNet++ could substantially improve the accuracy and robustness of object detection systems widely deployed in various industries, including surveillance, autonomous driving, and robotics.

Future research could explore the extension of this approach to more diverse datasets and detect more complex object interactions by refining the attention mechanisms used in visual domain bridging. Additionally, extending the representation bridging philosophy to other computer vision tasks, such as semantic segmentation or instance segmentation, may yield further advances.

Ultimately, the paper delineates a pathway that could lead future computational models to more efficiently integrate diverse visual data types, enhancing the overall efficacy of computer vision systems.

PDF Markdown

Related Papers

GitHub

GitHub - microsoft/RelationNet2: RelationNet++: Bridging Visual Representations for Object Detection via Transformer Decoder (210 stars)