Feature Pyramid Networks for Object Detection (1612.03144v2)

Published 9 Dec 2016 in cs.CV

Abstract: Feature pyramids are a basic component in recognition systems for detecting objects at different scales. But recent deep learning object detectors have avoided pyramid representations, in part because they are compute and memory intensive. In this paper, we exploit the inherent multi-scale, pyramidal hierarchy of deep convolutional networks to construct feature pyramids with marginal extra cost. A top-down architecture with lateral connections is developed for building high-level semantic feature maps at all scales. This architecture, called a Feature Pyramid Network (FPN), shows significant improvement as a generic feature extractor in several applications. Using FPN in a basic Faster R-CNN system, our method achieves state-of-the-art single-model results on the COCO detection benchmark without bells and whistles, surpassing all existing single-model entries including those from the COCO 2016 challenge winners. In addition, our method can run at 5 FPS on a GPU and thus is a practical and accurate solution to multi-scale object detection. Code will be made publicly available.

PDF Abstract

Feature Pyramid Networks for Object Detection

The paper "Feature Pyramid Networks for Object Detection" by Tsung-Yi Lin et al., introduces an advanced methodology to improve object detection tasks by leveraging the inherent pyramidal hierarchy of deep convolutional networks. This approach, called Feature Pyramid Networks (FPN), is designed to maintain the representational power, speed, and memory efficiency, hence laying significant groundwork for multi-scale object detection.

Introduction and Motivation

Multi-scale object detection has been a core challenge in computer vision, traditionally managed through featurized image pyramids, which scale objects for consistent detection accuracy across different resolutions. However, these pyramids are computationally expensive and memory-intensive. Recent approaches avoided multi-scale features, resorting to single-scale representations for inference speed. This paper presents an alternative—using feature pyramids that inherently exist within a ConvNet’s layer hierarchy due to its deep, multi-scale nature, complemented by a top-down architecture with lateral connections. The key is to retain high-level semantics at diverse scales, benefiting from the best of both multi-scale representations and efficiency.

Architecture

The FPN architecture enhances deep ConvNets by constructing a feature pyramid that combines low-level, high-resolution feature maps with high-level, low-resolution maps. It employs a top-down pathway to successively upsample high-level feature maps and merge them with the corresponding bottom-up maps via lateral connections. This results in multi-resolution feature maps that are rich in semantic information and spatially precise.

Key components of the feature pyramid construction are:

Bottom-up pathway: Utilizing the inherent pyramidal shape of the ConvNet’s feature hierarchy.
Top-down pathway: Providing higher resolution features through upsampling.
Lateral connections: Merging the top-down and bottom-up features to improve localization and strengthen semantics at each scale.

Experimental Results

The proposed FPN was evaluated extensively on benchmarks such as COCO, yielding strong numerical results across various tasks:

Region Proposal Networks (RPNs): FPNs significantly elevated the Average Recall (AR) by 8.0 points compared to single-scale RPN baselines, particularly improving small object detection.
Fast R-CNN and Faster R-CNN: The FPN-enhanced Faster R-CNN architecture demonstrated a notable 2.3-point improvement in AP over traditional ResNet baselines, validating the superiority of multi-scale feature representations.

Ablation studies confirmed the importance of the top-down pathway and lateral connections for maintaining high-level semantics and precise localization across feature maps.

Extensions to Instance Segmentation

Beyond object detection, FPNs were also adapted for generating segmentation proposals, extending the original application. The method simplified fully convolutional mask prediction, training these networks end-to-end and achieving outstanding results in generating segmentation proposals, surpassing previous state-of-the-art results by a significant margin.

Implications and Future Directions

From a practical perspective, FPNs enable efficient, scalable object detection and segmentation, obviating the need for computationally expensive image pyramids. The research implores new architectures and better methods for integrating feature pyramids within deep ConvNets, facilitating faster and more accurate multi-scale object detection and segmentation applicable in real-time settings.

From a theoretical standpoint, it elucidates the continued necessity of multi-scale representations even as ConvNets grow deeper and more semantically robust. The fusion of high-level semantics with spatially detailed features may inspire further explorations into optimizing lateral connections and enhancing top-down pathways.

Conclusion

The proposed Feature Pyramid Networks present a significant advancement in object detection and segmentation by efficiently utilizing the multi-scale nature of deep ConvNets. This has widespread implications for both practical applications and future research in designing more sophisticated multi-scale feature extraction networks, continuing to refine and enhance the accuracy and efficiency of computer vision systems.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Tsung-Yi Lin (49 papers)
Piotr Dollár (49 papers)
Ross Girshick (75 papers)
Kaiming He (71 papers)
Bharath Hariharan (82 papers)
Serge Belongie (125 papers)

Citations (20,271)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/ylecun/status/1795593210863251793

YouTube

Show All Videos