What Makes for End-to-End Object Detection? (2012.05780v2)

Published 10 Dec 2020 in cs.CV

Abstract: Object detection has recently achieved a breakthrough for removing the last one non-differentiable component in the pipeline, Non-Maximum Suppression (NMS), and building up an end-to-end system. However, what makes for its one-to-one prediction has not been well understood. In this paper, we first point out that one-to-one positive sample assignment is the key factor, while, one-to-many assignment in previous detectors causes redundant predictions in inference. Second, we surprisingly find that even training with one-to-one assignment, previous detectors still produce redundant predictions. We identify that classification cost in matching cost is the main ingredient: (1) previous detectors only consider location cost, (2) by additionally introducing classification cost, previous detectors immediately produce one-to-one prediction during inference. We introduce the concept of score gap to explore the effect of matching cost. Classification cost enlarges the score gap by choosing positive samples as those of highest score in the training iteration and reducing noisy positive samples brought by only location cost. Finally, we demonstrate the advantages of end-to-end object detection on crowded scenes. The code is available at: \url{https://github.com/PeizeSun/OneNet}.

Citations (81)

View on Semantic Scholar

Summary

The paper establishes that one-to-one positive sample assignment is crucial for end-to-end object detection, reducing redundant predictions.
It introduces classification cost in matching to widen the score gap and eliminate the need for Non-Maximum Suppression.
Empirical tests on crowded datasets validate this approach, showing robust performance improvements over traditional detectors.

An Overview of "What Makes for End-to-End Object Detection?"

The paper "What Makes for End-to-End Object Detection?" addresses a pivotal topic in computer vision: transitioning object detection systems from traditional architectures, which rely on non-differentiable components, to truly end-to-end systems. This transition enhances the pipeline by eliminating components like Non-Maximum Suppression (NMS).

Core Contributions

The paper posits that one-to-one positive sample assignment is vital for achieving end-to-end object detection. This contrasts with previous methods that relied on one-to-many assignments, leading to redundant predictions. The introduction of classification cost into the matching process is identified as a key factor allowing these systems to produce non-redundant predictions without the need for NMS.

Empirical Analysis

Through a systematic empirical analysis of both non-end-to-end detectors (e.g., RetinaNet, CenterNet, and FCOS) and end-to-end detectors (such as DETR, Deformable DETR, and Sparse R-CNN), the paper reveals:

Redundant Predictions: Traditional non-end-to-end detectors produce redundant predictions, even when training utilizes one-to-one assignment.
Introduction of Classification Cost: Incorporating classification cost alongside location cost effectively enables previous non-end-to-end detectors to achieve one-to-one prediction outputs.
Score Gap: The concept of the "score gap" highlights the distinction between the highest classification score and the others, which needs to be sufficiently large to mitigate redundancy.

Theoretical Implications

The paper provides a theoretical backing by analyzing the convergence properties of one-to-one sample assignment when classification cost is considered. The use of perceptron's update rule in a linear context demonstrates that assigning positive samples through classification cost can significantly widen the score gap and lead to stable, end-to-end predictions.

Results in Crowded Scenes

The paper tests its framework on crowded datasets, such as the CrowdHuman dataset, demonstrating enhanced performance of end-to-end versions of existing detectors. This is crucial as NMS fundamentally struggles with crowding, highlighting the robustness of the proposed end-to-end strategy.

Implications and Future Directions

The implications of this work are multifaceted. Practically, it suggests that current object detectors can be transformed into end-to-end systems through relatively straightforward modifications in training procedures, specifically the integration of classification cost. Theoretically, it challenges the longstanding emphasis on location-based cost in prior sample assignment methodologies, prompting a reevaluation of the criteria for effective object detection training.

Future research could focus on optimizing the training process further, potentially extending these findings to other domains within computer vision. Additionally, exploring how these principles can be adapted or enhanced within multi-object scenarios or different network architectures could provide considerable advancements.

In conclusion, the paper not only pinpoints the elemental changes necessary for achieving true end-to-end object detection but also sets a foundation for the development of more efficient and less error-prone object detection systems.

PDF Markdown

Related Papers

GitHub

GitHub - PeizeSun/OneNet: [ICML2021] What Makes for End-to-End Object Detection (650 stars)

Tweets

https://twitter.com/pythontrending/status/1347131767635980288

https://twitter.com/patri_vaquero_/status/1338061299176103936