Sparse R-CNN: End-to-End Object Detection with Learnable Proposals (2011.12450v2)

Published 25 Nov 2020 in cs.CV

Abstract: We present Sparse R-CNN, a purely sparse method for object detection in images. Existing works on object detection heavily rely on dense object candidates, such as $k$ anchor boxes pre-defined on all grids of image feature map of size $H\times W$. In our method, however, a fixed sparse set of learned object proposals, total length of $N$, are provided to object recognition head to perform classification and location. By eliminating $HWk$ (up to hundreds of thousands) hand-designed object candidates to $N$ (e.g. 100) learnable proposals, Sparse R-CNN completely avoids all efforts related to object candidates design and many-to-one label assignment. More importantly, final predictions are directly output without non-maximum suppression post-procedure. Sparse R-CNN demonstrates accuracy, run-time and training convergence performance on par with the well-established detector baselines on the challenging COCO dataset, e.g., achieving 45.0 AP in standard $3\times$ training schedule and running at 22 fps using ResNet-50 FPN model. We hope our work could inspire re-thinking the convention of dense prior in object detectors. The code is available at: https://github.com/PeizeSun/SparseR-CNN.

Citations (943)

View on Semantic Scholar

Summary

The paper presents a novel approach by shifting from densely generated object candidates to a fixed, sparse set of learnable proposals.
It introduces a dynamic instance interactive head and an iterative refinement process to enhance detection accuracy while reducing computational overhead.
Empirical results on benchmarks like COCO and CrowdHuman show that Sparse R-CNN achieves competitive performance without the need for traditional NMS post-processing.

An Analysis of "Sparse R-CNN: End-to-End Object Detection with Learnable Proposals"

The paper "Sparse R-CNN: End-to-End Object Detection with Learnable Proposals" introduces an innovative approach for object detection, challenging the prevailing dense methods that rely on extensive object candidate generation. Authored by a multi-institutional research team, this work proposes Sparse R-CNN, a method that leverages a fixed set of learned object proposals for classification and location tasks, significantly departing from conventional dense-prior object detectors.

Core Contributions

Sparse R-CNN's primary contribution lies in its move from densely enumerated object candidates to a significantly smaller, fixed sparse set of learnable proposals. Historically, dense methods like RetinaNet and Faster R-CNN have relied on hundreds of thousands of candidates generated over dense image grids, necessitating non-maximum suppression (NMS) post-processing and extensive label assignment heuristics. Sparse R-CNN, on the other hand, employs a predetermined set of learned object proposals, drastically reducing the number of object candidates (e.g., 100 proposals), and removing the need for both hand-designed object candidates and NMS post-processing.

Methodology

Sparse R-CNN comprises several key components:

Learnable Proposals:
- A fixed small set of learnable proposal boxes (parameterized normalized coordinates) replaces the traditional dense enumeration used in models like RPN. These are optimized during training, demonstrating robustness to various initialization methods.
Dynamic Instance Interactive Head:
- RoI features extracted via RoIAlign operation are processed with a dynamic head unique to each proposal, making it more flexible and accurate. The head design is influenced by dynamic algorithms, using specific proposal features to generate dynamic parameters for interactions within object-specific heads.
Iterative Architecture:
- The iterative process refines object predictions across stages by reusing object features, improving box precision with each iteration. This iterative refinement significantly enhances detection performance, especially when paired with self-attention mechanisms.

Empirical Performance

Sparse R-CNN demonstrates competitive performance across major benchmarks:

COCO Dataset:
- Achieved 45.0 AP with a ResNet-50 FPN model in a 3x training schedule, running at 22 fps.
- Notably provided faster training convergence (10x faster than DETR) and improved performance on small objects (26.7 AP compared to DETR's 22.5 AP).
CrowdHuman Dataset:
- Outperformed mainstream detectors such as Faster R-CNN, RetinaNet, and even end-to-end detectors like DETR and Deformable DETR.
- Achieved 89.2 AP and 48.3 mMR without the need for NMS, demonstrating efficacy in highly crowded scenes.

Implementation Insights

Several critical aspects of Sparse R-CNN highlight its innovative nature:

The learnable proposals and their initialization methods (center, image, grid, random) exhibit negligible effect differences, pointing to the robustness of the learning process.
Different methods of proposal interaction (dynamic head vs. multi-head attention) underline the superiority of dynamic heads in processing proposal features.
The iterative architecture’s enhancement through object feature reuse and self-attention modules improve proposal refinement, driving performance gains.

Future Implications

Sparse R-CNN's purely sparse approach simplifies the object detection pipeline, advocating for a shift away from densely populated candidates towards intelligently learned proposals. This not only reduces computational overhead but also alleviates the complexities associated with NMS and dense candidate design. Future extensions could explore the integration of more sophisticated feature extraction techniques, hybrid CNN-Transformer architectures, and self-supervised pre-training paradigms, potentially broadening the scope and applicability of Sparse R-CNN in various contexts.

In conclusion, Sparse R-CNN signals a pivotal shift in object detection research. Its innovative use of sparse, learned proposals and dynamic instance interaction presents a viable blueprint for the next generation of streamlined, efficient object detectors, setting a new benchmark for both academic paper and practical deployment in computer vision tasks.

PDF Markdown