Learning Spatial Fusion for Single-Shot Object Detection (1911.09516v2)

Published 21 Nov 2019 in cs.CV

Abstract: Pyramidal feature representation is the common practice to address the challenge of scale variation in object detection. However, the inconsistency across different feature scales is a primary limitation for the single-shot detectors based on feature pyramid. In this work, we propose a novel and data driven strategy for pyramidal feature fusion, referred to as adaptively spatial feature fusion (ASFF). It learns the way to spatially filter conflictive information to suppress the inconsistency, thus improving the scale-invariance of features, and introduces nearly free inference overhead. With the ASFF strategy and a solid baseline of YOLOv3, we achieve the best speed-accuracy trade-off on the MS COCO dataset, reporting 38.1% AP at 60 FPS, 42.4% AP at 45 FPS and 43.9% AP at 29 FPS. The code is available at https://github.com/ruinmessi/ASFF

Citations (495)

View on Semantic Scholar

Summary

The paper introduces Adaptively Spatial Feature Fusion (ASFF) to dynamically fuse multi-scale features in single-shot detectors, enhancing their scale invariance.
The methodology employs feature resizing and adaptive fusion with softmax-based weights to seamlessly integrate pyramidal features.
Experiments on MS COCO demonstrate significant improvements in accuracy (38.1% AP at 60 FPS and 43.9% AP at 29 FPS) with minimal computational overhead.

Learning Spatial Fusion for Single-Shot Object Detection

The paper under consideration introduces a novel approach to enhance single-shot object detectors by addressing the scale-invariance challenges associated with feature pyramid structures. The proposed method, Adaptively Spatial Feature Fusion (ASFF), is designed to resolve the inconsistencies across feature scales, which is a known bottleneck for single-shot detectors such as YOLOv3.

Problem Statement

Scale variation remains a significant obstacle in object detection, particularly for models utilizing pyramidal feature representations. These models typically aggregate multi-scale features but are prone to inconsistencies that affect both training and inference phases. The inconsistency across different scales in feature pyramids can degrade detection performance, especially in real-time applications that cannot afford the computational cost of multi-scale image pyramids.

Proposed Solution: Adaptively Spatial Feature Fusion (ASFF)

ASFF is introduced as a data-driven strategy for fusing pyramidal features. It dynamically learns to spatially filter conflicting information, thereby improving the scale-invariance of features. This method is distinctive in its ability to adaptively determine the spatial importance of features across scales, maintaining a minimal computational overhead. ASFF operates by:

Feature Resizing: Aligning all features to a common resolution, allowing them to be fused seamlessly.
Adaptive Fusion: Learning weights for feature maps at different levels, ensuring that only relevant information contributes to the final predictions. This approach employs a softmax function to ensure the weights sum to one, facilitating gradient-based learning.

Experimental Evaluation

The paper validates the ASFF approach using the MS COCO dataset. Experiments demonstrate substantial improvements in accuracy, reporting 38.1% AP at 60 FPS and scaling up to 43.9% AP at 29 FPS. The ASFF-enhanced model consistently outperforms state-of-the-art algorithms, achieving an optimal speed-accuracy trade-off without significant computational burden.

The paper situates its contribution within the broader landscape of multi-scale feature processing techniques, contrasting ASFF's approach with other methods like SSD, FPN, and NAS-FPN. Unlike its predecessors, which often rely on manual structure design or incur heavy computational costs, ASFF offers a learned and streamlined solution to the scale invariance issue.

Implications and Future Directions

The ASFF approach advances both the practical and theoretical understanding of object detection frameworks by introducing a feature fusion method that optimizes performance without hindering real-time detection capabilities. Future research could explore extending ASFF to other detection architectures or integrating it with emerging neural network structures that focus on different vision tasks.

The contribution of ASFF underscores the potential for adaptive methods in enhancing feature fusion strategies, paving the way for more efficient and accurate vision systems. As the field of AI evolves, such innovative approaches could further refine how scale and spatial information are managed within neural networks, enhancing their applicability across a wide range of real-world scenarios.