Scale-Equalizing Pyramid Convolution for Object Detection (2005.03101v1)

Published 6 May 2020 in cs.CV

Abstract: Feature pyramid has been an efficient method to extract features at different scales. Development over this method mainly focuses on aggregating contextual information at different levels while seldom touching the inter-level correlation in the feature pyramid. Early computer vision methods extracted scale-invariant features by locating the feature extrema in both spatial and scale dimension. Inspired by this, a convolution across the pyramid level is proposed in this study, which is termed pyramid convolution and is a modified 3-D convolution. Stacked pyramid convolutions directly extract 3-D (scale and spatial) features and outperforms other meticulously designed feature fusion modules. Based on the viewpoint of 3-D convolution, an integrated batch normalization that collects statistics from the whole feature pyramid is naturally inserted after the pyramid convolution. Furthermore, we also show that the naive pyramid convolution, together with the design of RetinaNet head, actually best applies for extracting features from a Gaussian pyramid, whose properties can hardly be satisfied by a feature pyramid. In order to alleviate this discrepancy, we build a scale-equalizing pyramid convolution (SEPC) that aligns the shared pyramid convolution kernel only at high-level feature maps. Being computationally efficient and compatible with the head design of most single-stage object detectors, the SEPC module brings significant performance improvement ($>4$AP increase on MS-COCO2017 dataset) in state-of-the-art one-stage object detectors, and a light version of SEPC also has $\sim3.5$AP gain with only around 7% inference time increase. The pyramid convolution also functions well as a stand-alone module in two-stage object detectors and is able to improve the performance by $\sim2$AP. The source code can be found at https://github.com/jshilong/SEPC.

PDF Abstract

Scale-Equalizing Pyramid Convolution for Object Detection

The paper "Scale-Equalizing Pyramid Convolution for Object Detection" introduces a convolutional approach to enhance feature extraction from multi-scale image data, addressing challenges in object detection posed by scale variability. The authors propose the Pyramid Convolution (PConv) and the Scale-Equalizing Pyramid Convolution (SEPC) to improve the correlation and integration of features across different scales in feature pyramids.

Methodology Overview

The authors leverage the concept of 3-D convolution by integrating spatial and scale dimensions, which allows for robust extraction of scale-correlated features. PConv modifies the traditional 3-D convolution to accommodate the size mismatch across different pyramid levels by adjusting the strides of convolutional kernels accordingly. This nuanced approach ensures alignment and correlation of feature maps from varying scales.

Contributions and Findings

Pyramid Convolution (PConv):
- The architecture introduces inter-scale interactions through explicit convolutions in the scale dimension, providing a mechanism to cater to correlations that were previously overlooked.
Simplified and Efficient Design:
- PConv enables simultaneous scale and feature extraction, compatible with existing single-stage object detector heads, notably RetinaNet, while maintaining computational efficiency.
Scale-Equalizing Pyramid Convolution (SEPC):
- SEPC is developed to alleviate discrepancies between feature pyramids and Gaussian pyramids typically extracted in deep networks. By adjusting kernel spatial deformation while traversing scales, SEPC achieves more consistent feature alignment, significantly enhancing detection accuracy.

Numerical Results and Comparisons

The paper's numerical results show notable improvements in detection performance:

With SEPC, one-stage detector models observed over 4 AP increases on the MS COCO 2017 dataset.
A lighter SEPC variant also yielded approximately 3.5 AP gains with a minor increase in inference time (7%).
Implementing these methods in two-stage object detectors brought an improvement of around 2 AP, exhibiting the versatility of the approach.

Implications and Future Work

The introduction of PConv and SEPC demonstrates the potential for enhanced feature extraction methods that directly address scale-related challenges. The technique's computational efficiency and integration capability make it a promising enhancement for a broad range of detection networks. Future work could focus on refining the scale-equalizing aspect, potentially applying these advances to more complex and varied datasets or even extending the concepts to domains beyond traditional object detection.

The paper's contributions lie in its demonstration of effective inter-scale interaction within the architecture of state-of-the-art models. As AI models increasingly require solutions to multi-scale challenges inherent in real-world data, approaches like SEPC will likely play a critical role in developing more robust and accurate detection systems.