Detection in Crowded Scenes: One Proposal, Multiple Predictions

Published 20 Mar 2020 in cs.CV | (2003.09163v2)

Abstract: We propose a simple yet effective proposal-based object detector, aiming at detecting highly-overlapped instances in crowded scenes. The key of our approach is to let each proposal predict a set of correlated instances rather than a single one in previous proposal-based frameworks. Equipped with new techniques such as EMD Loss and Set NMS, our detector can effectively handle the difficulty of detecting highly overlapped objects. On a FPN-Res50 baseline, our detector can obtain 4.9\% AP gains on challenging CrowdHuman dataset and 1.0\% $\text{MR}^{-2}$ improvements on CityPersons dataset, without bells and whistles. Moreover, on less crowed datasets like COCO, our approach can still achieve moderate improvement, suggesting the proposed method is robust to crowdedness. Code and pre-trained models will be released at https://github.com/megvii-model/CrowdDetection.

Abstract PDF Upgrade to Chat

Citations (152)

View on Semantic Scholar

Summary

The paper introduces a multi-instance prediction framework that allows proposals to predict several overlapping objects, effectively addressing detection challenges in crowded scenes.
It implements innovative techniques such as EMD loss and Set NMS to measure set distances and mitigate excessive suppression, enhancing separation of close objects.
Experiments on datasets like CrowdHuman and CityPersons demonstrate significant performance gains, confirming the framework's robustness in varying crowd densities.

Essay on "Detection in Crowded Scenes: One Proposal, Multiple Predictions"

The paper "Detection in Crowded Scenes: One Proposal, Multiple Predictions" addresses a significant challenge in the domain of object detection: accurately identifying heavily overlapped instances in crowded scenes. The authors propose a novel approach where each detection proposal is tasked with predicting a set of correlated instances, diverging from traditional single-instance predictions within proposal-based object detectors.

Key Contributions

Multiple Instance Prediction Framework: The primary innovation in this work is the conversion of traditional single-object prediction per proposal to multiple object predictions. This allows each proposal to generate a set of potential instance predictions, thus accommodating the frequent overlaps observed in crowded scenes. By introducing a predetermined maximum number of predictions per proposal ( $K=2$ in experimental settings), the proposed method effectively deals with crowded object scenarios.
EMD Loss and Set NMS: The paper introduces the Earth Mover's Distance (EMD) Loss, which measures the set distance between predicted and actual instance sets, enhancing the training process by minimizing matching distances. Moreover, a novel Set Non-Maximum Suppression (Set NMS) technique is devised to mitigate the limitations of traditional NMS in heavily overlapped situations. This enhancement prevents excessive suppression of nearby object instances, which is a common shortcoming in dense settings.
Refinement Module: To handle the increased risk of false positives associated with multiple predictions per proposal, the authors propose an optional refinement module. This module leverages proposal features alongside instance predictions to deliver a second-pass prediction, which helps in correcting false positives.

Experimental Findings

CrowdHuman Dataset Performance:

The proposed method demonstrates significant performance enhancements on the CrowdHuman dataset, achieving a 4.9\% improvement in AP scores without additional complexities. This robust performance is reiterated across varying crowd densities, indicating the model's adaptability and effectiveness in crowded conditions.

CityPersons and COCO Datasets:

Despite the focus on crowded scenes, the method yields a 1.0\% improvement in $\text{MR}^{-2}$ on the CityPersons dataset and enhances the AP score on the COCO dataset by 1.0\%. These findings establish the robustness of the method across less dense datasets, confirming that the improvements in crowded scenarios do not compromise performance in sparse situations.

Implications and Future Directions

The implications of this research are substantial both theoretically and practically. Theoretically, it challenges the single-instance paradigm by proposing a multiple-instance framework, showcasing the benefits of predicting correlated objects. Practically, it opens avenues for improved detector performance in real-world applications where scenes frequently include overlapping objects, such as pedestrian crowds or densely packed urban environments.

Looking forward, future developments could involve exploring dynamic $K$ values to cater to varying object densities and employing the proposed framework in other domains, such as medical imaging or satellite imagery, where object overlap is prevalent. Additionally, integrating the approach with more diverse and complex datasets will further validate its applicability and explore potential edge cases.

In summary, the paper presents a well-rounded conceptual and practical advancement in object detection methodologies, particularly for crowded scenes. The integration of multiple instance predictions, alongside novel loss and suppression techniques, marks a promising evolution within the field, poised for broader applications across various challenging environments in object detection tasks.

Markdown