End-to-End Object Detection with Fully Convolutional Network (2012.03544v3)

Published 7 Dec 2020 in cs.CV and cs.AI

Abstract: Mainstream object detectors based on the fully convolutional network has achieved impressive performance. While most of them still need a hand-designed non-maximum suppression (NMS) post-processing, which impedes fully end-to-end training. In this paper, we give the analysis of discarding NMS, where the results reveal that a proper label assignment plays a crucial role. To this end, for fully convolutional detectors, we introduce a Prediction-aware One-To-One (POTO) label assignment for classification to enable end-to-end detection, which obtains comparable performance with NMS. Besides, a simple 3D Max Filtering (3DMF) is proposed to utilize the multi-scale features and improve the discriminability of convolutions in the local region. With these techniques, our end-to-end framework achieves competitive performance against many state-of-the-art detectors with NMS on COCO and CrowdHuman datasets. The code is available at https://github.com/Megvii-BaseDetection/DeFCN .

Citations (179)

View on Semantic Scholar

Summary

The paper introduces a prediction-aware one-to-one (POTO) label assignment that dynamically balances classification and regression for enhanced detection quality.
The fully convolutional network leverages 3D Max Filtering to integrate multi-scale features and suppress duplicate predictions without post-processing.
The model achieves superior performance on COCO and CrowdHuman datasets, marking a significant advancement in efficient, end-to-end object detection.

End-to-End Object Detection with Fully Convolutional Network

The paper presents advancements in object detection through a fully convolutional network, specifically addressing the limitations posed by non-maximum suppression (NMS) in achieving end-to-end training. The authors introduce a Prediction-aware One-To-One (POTO) label assignment for classification, facilitating the removal of NMS post-processing while maintaining competitive performance.

Key Contributions

Prediction-aware One-to-One (POTO) Label Assignment: The paper develops a strategy for dynamically assigning labels based on prediction quality, effectively reducing reliance on NMS. This approach focuses on a balanced integration of classification confidence and regression quality, represented through a geometric mean, which enhances detection performance by appropriately designating predictions as foreground samples.
3D Max Filtering (3DMF): This module is innovatively designed to operate across scales in a Feature Pyramid Network (FPN), utilizing multi-scale information to improve the discriminability of convolutions. The 3DMF adapts a differentiable module within the network, leveraging local suppression of duplicate predictions, which is significant for optimizing end-to-end object detection.
Auxiliary Loss for Improved Representation: To bolster feature representation capabilities, especially when using the POTO label assignment, an auxiliary loss component based on a one-to-many label assignment is integrated. This addition provides robust supervision, which enhances the model’s ability to discern and predict bounding boxes accurately.

Experiments and Results

The proposed framework was evaluated on COCO and CrowdHuman datasets, achieving impressive performance metrics. Specifically, the framework outperformed several state-of-the-art detectors, including those that rely on NMS. Notably, on the COCO dataset using the ResNeXt-101 backbone, it provided a substantial mAP improvement over the baseline FCOS model. Furthermore, experiments on the CrowdHuman dataset demonstrated the framework's robustness in handling crowded scenes, with significant gains in AP $_{50}$ and mMR.

Implications and Future Directions

The removal of NMS as a post-processing requirement opens up potential for more streamlined and efficient detector architectures, enhancing adaptability in various deployment scenarios. The implications of this work extend into scenarios where real-time processing and scalability are critical, such as autonomous driving and surveillance systems.

For future developments, the architecture could potentially integrate into more complex systems, such as incorporating attention mechanisms or leveraging larger, more diverse datasets to improve generalization further. Additionally, exploring the interaction between POTO and other architectural advancements, like deformable convolutions, could yield further insights into optimizing end-to-end performance across diverse object detection tasks.

Conclusion

In summary, the paper offers significant methodological advancements by integrating prediction-aware label assignment and multi-scale feature considerations to achieve effective end-to-end object detection without NMS. Through the deployment of a fully convolutional framework, the research marks an evolutionary step in object detection, paving the way for more efficient and adaptable computer vision systems.

PDF Markdown

Related Papers

FCOS: Fully Convolutional One-Stage Object Detection (2019)
Dense Distinct Query for End-to-End Object Detection (2023)
Object Detection Made Simpler by Eliminating Heuristic NMS (2021)
What Makes for End-to-End Object Detection? (2020)
Learning non-maximum suppression (2017)

GitHub

GitHub - Megvii-BaseDetection/DeFCN: End-to-End Object Detection with Fully Convolutional Network (495 stars)