R-FCN: Object Detection via Region-based Fully Convolutional Networks (1605.06409v3)

Published 20 May 2016 in cs.CV

Abstract: We present region-based, fully convolutional networks for accurate and efficient object detection. In contrast to previous region-based detectors such as Fast/Faster R-CNN that apply a costly per-region subnetwork hundreds of times, our region-based detector is fully convolutional with almost all computation shared on the entire image. To achieve this goal, we propose position-sensitive score maps to address a dilemma between translation-invariance in image classification and translation-variance in object detection. Our method can thus naturally adopt fully convolutional image classifier backbones, such as the latest Residual Networks (ResNets), for object detection. We show competitive results on the PASCAL VOC datasets (e.g., 83.6% mAP on the 2007 set) with the 101-layer ResNet. Meanwhile, our result is achieved at a test-time speed of 170ms per image, 2.5-20x faster than the Faster R-CNN counterpart. Code is made publicly available at: https://github.com/daijifeng001/r-fcn

PDF HTML Abstract

R-FCN: Object Detection via Region-based Fully Convolutional Networks

The paper "R-FCN: Object Detection via Region-based Fully Convolutional Networks" introduces a novel framework for object detection using region-based, fully convolutional networks (R-FCN). This new approach aims to enhance both the accuracy and computational efficiency of object detectors. Building on the foundation of existing region-based detectors such as Fast R-CNN and Faster R-CNN, the authors propose a fully convolutional architecture that circumvents the computational redundancy inherent in previous methods by sharing almost all computations across the entire image.

Key Concepts and Innovations

Fully Convolutional Architecture: The R-FCN architecture employs fully convolutional networks (FCNs) for shared computation across the entire image, thus reducing redundant per-region computation. Given the highly translation-invariant nature of fully convolutional classifiers such as ResNets, the paper addresses the challenge of incorporating translation variance, which is crucial for object detection.
Position-Sensitive Score Maps: The approach introduces position-sensitive score maps to reconcile the conflicting needs for translation invariance in image classification and translation variance in object detection. This is achieved through specialized convolutional layers that output a set of score maps encoding position information relative to object parts (e.g., "top-left", "bottom-right").
Position-Sensitive RoI Pooling: On top of the FCN, a position-sensitive Region-of-Interest (RoI) pooling layer aggregates information from the position-sensitive score maps. This pooling layer ensures that spatial relations are maintained without requiring additional learnable layers post-RoI pooling, thereby optimizing both computational efficiency and detection accuracy.

Numerical Results

The efficacy of the R-FCN framework is demonstrated through extensive experiments. Using ResNet-101 as the backbone network:

On the PASCAL VOC 2007 dataset, R-FCN achieves an impressive 83.6% mean Average Precision (mAP) with a test-time speed of 170 milliseconds per image. This performance is 2.5 to 20 times faster than the Faster R-CNN counterpart with similar accuracy.
On the PASCAL VOC 2012 dataset, R-FCN reports an mAP of 82.0%.

These results illustrate that R-FCN effectively balances the necessity for positional accuracy while benefiting from the computational advantages of fully convolutional architectures.

Practical and Theoretical Implications

Practically, the R-FCN framework presents substantial improvements in both speed and accuracy over conventional region-based object detectors. This enhancement can facilitate real-time applications in various domains such as autonomous driving, surveillance, and real-time video analytics.

Theoretically, the framework opens new avenues for combining translation-invariant and translation-variant representations within a fully convolutional setting. Future research could explore similar methodologies in other dense prediction tasks like semantic segmentation and instance segmentation.

Future Directions

Possible future directions for this research include:

Experimenting with alternative backbone networks to further validate and potentially enhance the generalizability of the R-FCN framework.
Extending the R-FCN architecture to handle multi-modal data (e.g., combining RGB and depth images) for improved scene understanding in robotics.
Integrating additional context information and iterative refinement processes to further boost detection performance, particularly for small or occluded objects.

In conclusion, R-FCN represents a significant step forward in efficient and accurate object detection, leveraging fully convolutional networks to harmoniously blend shared computation and precise localization. The comprehensive evaluation and superior performance metrics make R-FCN a robust model for future advancements in the field of computer vision and AI.

PDF Markdown Bookmark Chat (Pro)

References (28)

Authors (4)

Jifeng Dai (131 papers)
Yi Li (482 papers)
Kaiming He (71 papers)
Jian Sun (415 papers)

Citations (5,465)

View on Semantic Scholar

R-FCN: Object Detection via Region-based Fully Convolutional Networks (1605.06409v3)