Point in, Box out: Beyond Counting Persons in Crowds (1904.01333v2)

Published 2 Apr 2019 in cs.CV

Abstract: Modern crowd counting methods usually employ deep neural networks (DNN) to estimate crowd counts via density regression. Despite their significant improvements, the regression-based methods are incapable of providing the detection of individuals in crowds. The detection-based methods, on the other hand, have not been largely explored in recent trends of crowd counting due to the needs for expensive bounding box annotations. In this work, we instead propose a new deep detection network with only point supervision required. It can simultaneously detect the size and location of human heads and count them in crowds. We first mine useful person size information from point-level annotations and initialize the pseudo ground truth bounding boxes. An online updating scheme is introduced to refine the pseudo ground truth during training; while a locally-constrained regression loss is designed to provide additional constraints on the size of the predicted boxes in a local neighborhood. In the end, we propose a curriculum learning strategy to train the network from images of relatively accurate and easy pseudo ground truth first. Extensive experiments are conducted in both detection and counting tasks on several standard benchmarks, e.g. ShanghaiTech, UCF_CC_50, WiderFace, and TRANCOS datasets, and the results show the superiority of our method over the state-of-the-art.

Authors (4)

Yuting Liu (62 papers)
Miaojing Shi (53 papers)
Qijun Zhao (46 papers)
Xiaofang Wang (30 papers)

Citations (178)

View on Semantic Scholar

Summary

Insightful Overview of "Point in, Box out: Beyond Counting Persons in Crowds"

The paper "Point in, Box out: Beyond Counting Persons in Crowds" presents an innovative approach to crowd counting that transcends traditional density regression methods by integrating detection capabilities without the burden of bounding box annotations. This dual-task methodology is underpinned by a novel deep detection network that requires only point-level supervision.

Conventional crowd counting relies heavily on regression-based methods, where deep neural networks (DNN) are utilized to estimate crowd counts through density map regression. Although substantial advancements have been made, the inherent limitation of these methods is their incapability to perform individual detection. In contrast, detection-based approaches, which identify individuals within crowds, have not been widely adopted due to the high cost of acquiring bounding box annotations.

The proposed approach introduces a point-supervised deep detection network (PSDDN) capable of simultaneously detecting the size and location of human heads and counting them in crowds. The methodology leverages point-level annotations to mine useful person size information, initializing pseudo ground truth bounding boxes. An online updating scheme is proposed to refine these pseudo ground truths during training, complemented by a locally-constrained regression loss designed to constrain the size of predicted boxes within a local neighborhood.

One of the salient contributions of this research is the introduction of a curriculum learning strategy. The network begins training with images containing relatively accurate and easy pseudo ground truths and gradually incorporates more complex instances. This progressive learning approach ensures the model is robustly equipped to handle variations in crowd density and perspective distortions.

Extensive experiments were conducted on benchmark datasets, including ShanghaiTech, UCF_CC_50, WiderFace, and TRANCOS. The results illustrate the superiority of PSDDN in both detection and counting tasks compared to state-of-the-art methods. The approach shows remarkable adaptability across different contexts, as evidenced by its strong performance not only in traditional crowd counting datasets but also in vehicle counting tasks on the TRANCOS dataset.

Implications and Future Directions

The implications of this research are multifaceted, impacting both practical applications and theoretical frameworks. Practically, the ability to accurately count and detect individuals in crowded scenes holds significant promise for applications such as video surveillance, behavioral modeling, and safety monitoring. The reduced annotation cost could democratize access to sophisticated crowd analysis technologies.

Theoretically, the paper advances the discourse on weakly-supervised learning, presenting a compelling case for the use of point-level annotations to extract meaningful size and detection information. This research could inspire further investigations into minimizing reliance on detailed annotations across other domains.

Future developments in AI could expand on the presented methods by exploring deeper integration of curriculum learning strategies across neural networks to enhance model adaptability to complex datasets. Additionally, refining online updating schemes and local regression losses could further optimize the balance between detection precision and computational complexity.

In summary, the paper offers substantial contributions to the field of crowd counting and detection, providing a novel approach that leverages minimal supervision while achieving state-of-the-art performance across multiple datasets. This work paves the way for continued exploration into efficient, scalable methods for crowded scene analysis.

PDF Markdown

Point in, Box out: Beyond Counting Persons in Crowds (1904.01333v2)

Summary

Insightful Overview of "Point in, Box out: Beyond Counting Persons in Crowds"

Implications and Future Directions

Related Papers