Progressive End-to-End Object Detection in Crowded Scenes (2203.07669v3)

Published 15 Mar 2022 in cs.CV

Abstract: In this paper, we propose a new query-based detection framework for crowd detection. Previous query-based detectors suffer from two drawbacks: first, multiple predictions will be inferred for a single object, typically in crowded scenes; second, the performance saturates as the depth of the decoding stage increases. Benefiting from the nature of the one-to-one label assignment rule, we propose a progressive predicting method to address the above issues. Specifically, we first select accepted queries prone to generate true positive predictions, then refine the rest noisy queries according to the previously accepted predictions. Experiments show that our method can significantly boost the performance of query-based detectors in crowded scenes. Equipped with our approach, Sparse RCNN achieves 92.0\% $\text{AP}$, 41.4\% $\text{MR}^{-2}$ and 83.2\% $\text{JI}$ on the challenging CrowdHuman \cite{shao2018crowdhuman} dataset, outperforming the box-based method MIP \cite{chu2020detection} that specifies in handling crowded scenarios. Moreover, the proposed method, robust to crowdedness, can still obtain consistent improvements on moderately and slightly crowded datasets like CityPersons \cite{zhang2017citypersons} and COCO \cite{lin2014microsoft}. Code will be made publicly available at https://github.com/megvii-model/Iter-E2EDET.

Citations (50)

View on Semantic Scholar

Summary

The paper presents a progressive query-based detection framework that uses one-to-one label assignment to reduce false positives.
It integrates a prediction selector, local self-attention, and relation extractor to refine noisy queries using high-confidence outputs.
Empirical tests on CrowdHuman, CityPersons, and COCO show significant performance gains over existing methods.

Progressive End-to-End Object Detection in Crowded Scenes

The paper "Progressive End-to-End Object Detection in Crowded Scenes" presents a novel approach to query-based object detection specifically designed to enhance performance in crowded scenes. While existing query-based detectors often predict multiple outputs for a single object and suffer from performance saturation issues as decoding stages deepen, this work introduces a progressive prediction method to address these challenges effectively.

The researchers propose a new query-based detection framework characterized by a progressive prediction methodology that focuses on the one-to-one label assignment rule. This approach involves first selecting accepted queries that are likely to yield true positive predictions. Subsequently, the framework aims to refine the remaining noisy queries based on the true positive predictions previously accepted. The proposed method introduces several components: a prediction selector, a relation information extractor, a query updater, and a novel label assignment method.

The prediction selector functions by identifying and tabulating queries likely to produce high-confidence predictions, classifying these as accepted while labeling the remaining as noisy. In parallel, the relation information extractor employs these accepted predictions to refine the noisy ones through context modeling, integrating a spatial reasoning approach to improve discrimination capability.

Notably, the query updater employs a local self-attention mechanism, allowing the model to process spatially related neighbors rather than the entire image context, thereby enhancing precision in crowded environments. This component, in conjunction with the label assignment scheme, aims to ensure each object is detected precisely once, reducing false positives and bolstering true positives.

Empirical evaluations conducted on datasets such as CrowdHuman, CityPersons, and COCO have demonstrated that the approach significantly enhances the performance of query-based detectors in crowded scenarios. Specifically, the Sparse RCNN equipped with this method achieved a 92% AP on CrowdHuman, surpassing notable existing methods like MIP. Furthermore, the framework has shown resilience across different crowd densities, maintaining performance on moderately crowded datasets such as CityPersons and COCO.

The implications of this research extend both practically and theoretically. Practically, the method outperforms existing state-of-the-art methods and provides a robust framework for real-world applications involving crowded scenes. Theoretically, the shift towards progressively refining predictions based on high-confidence outputs introduces a paradigm that may influence future developments in object detection frameworks in artificial intelligence.

Looking forward, while the current work provides impressive results and a solid foundation, further investigations could optimize the framework’s efficiency, making it more computationally feasible for deployment on resource-constrained devices. Additionally, exploring improved feature engineering methods or loss functions might unlock further advancements in handling the complexities of crowded scene detection and assimilation of emerging vision transformer architectures.