CNN-based Cascaded Multi-task Learning of High-level Prior and Density Estimation for Crowd Counting (1707.09605v2)

Published 30 Jul 2017 in cs.CV

Abstract: Estimating crowd count in densely crowded scenes is an extremely challenging task due to non-uniform scale variations. In this paper, we propose a novel end-to-end cascaded network of CNNs to jointly learn crowd count classification and density map estimation. Classifying crowd count into various groups is tantamount to coarsely estimating the total count in the image thereby incorporating a high-level prior into the density estimation network. This enables the layers in the network to learn globally relevant discriminative features which aid in estimating highly refined density maps with lower count error. The joint training is performed in an end-to-end fashion. Extensive experiments on highly challenging publicly available datasets show that the proposed method achieves lower count error and better quality density maps as compared to the recent state-of-the-art methods.

Citations (491)

View on Semantic Scholar

Summary

The paper presents a novel cascaded multi-task CNN model that jointly performs crowd count classification and density map estimation to enhance accuracy.
It employs spatial pyramid pooling and fractionally strided convolutions to handle variable image sizes and recover fine details.
Experiments on the ShanghaiTech and UCF_CC_50 datasets demonstrate reduced MAE and MSE, underscoring its robustness in diverse crowd scenarios.

CNN-Based Cascaded Multi-Task Learning for Crowd Counting

The paper, "CNN-based Cascaded Multi-task Learning of High-level Prior and Density Estimation for Crowd Counting," introduces an innovative approach to addressing the challenging problem of crowd counting in densely populated scenes. This research proposes a novel end-to-end cascaded network of CNNs designed to jointly address crowd count classification and density map estimation. By integrating high-level prior knowledge into the density estimation process, the authors suggest that more globally relevant discriminative features can be learned, resulting in more accurate and refined density maps with reduced count error.

Methodology

The principal methodology centers around a cascaded architecture involving two interconnected stages: crowd count classification (high-level prior) and density map estimation. These tasks share an initial convolutional phase, which is followed by two parallel networks specialized for their respective tasks.

High-Level Prior Stage:
- Utilizes convolutional layers and fully connected layers to classify crowd count into distinct groups.
- Employs Spatial Pyramid Pooling to handle varying image sizes without the constraints typical of networks with fully connected layers.
- Introduces a high-level prior that indirectly informs the density estimation by categorizing the total crowd count into specific groups.
Density Estimation Stage:
- Features fractionally strided convolutional layers aimed at counteracting detail loss during downsampling.
- Processes feature maps from the classification stage and refines them into high-resolution density estimates.

The network's loss function comprises two parts: a cross-entropy loss for the classification task and a pixel-wise Euclidean loss for density estimation, weighted to ensure a balanced training emphasis between tasks.

Experimental Results

The methodology's efficacy was tested on two challenging datasets: ShanghaiTech and UCF_CC_50, both renowned for their complexity and variability in crowd density.

ShanghaiTech Dataset:
- The proposed method demonstrated substantial improvements in both MAE and MSE over prior state-of-the-art methods, including Zhang et al.'s MCNN approach.
UCF_CC_50 Dataset:
- Achieved the lowest MAE, showcasing its robustness across varying image resolutions and density complexities.

These results substantiate the model's ability to generalize across diverse densities and image contexts, highlighting the benefits of incorporating a high-level prior via a cascaded CNN architecture.

Implications and Future Work

This research potentially offers a significant step forward in automated crowd counting, with implications for practical applications such as public safety design, surveillance, and event management. The integration of a high-level prior to inform density estimation introduces a new dimension of interpretability and accuracy to CNN-based approaches.

Looking forward, expanding this methodology to accommodate additional tasks or optimize processing speed could further enhance real-world applicability. Additionally, exploring similar architectures for other multi-task computer vision challenges might yield promising results. Future research could also involve scaling these models efficiently while maintaining their accuracy, particularly in real-time applications.

Conclusion

The paper presents an original contribution to the field of crowd counting, demonstrating the advantages of a cascaded multi-task learning framework. By effectively addressing variation in crowd density and leveraging high-level priors, the proposed technique achieves superior performance compared to existing models, marking an important contribution to video and signal-based surveillance research.

PDF Markdown