Switching Convolutional Neural Network for Crowd Counting (1708.00199v2)

Published 1 Aug 2017 in cs.CV

Abstract: We propose a novel crowd counting model that maps a given crowd scene to its density. Crowd analysis is compounded by myriad of factors like inter-occlusion between people due to extreme crowding, high similarity of appearance between people and background elements, and large variability of camera view-points. Current state-of-the art approaches tackle these factors by using multi-scale CNN architectures, recurrent networks and late fusion of features from multi-column CNN with different receptive fields. We propose switching convolutional neural network that leverages variation of crowd density within an image to improve the accuracy and localization of the predicted crowd count. Patches from a grid within a crowd scene are relayed to independent CNN regressors based on crowd count prediction quality of the CNN established during training. The independent CNN regressors are designed to have different receptive fields and a switch classifier is trained to relay the crowd scene patch to the best CNN regressor. We perform extensive experiments on all major crowd counting datasets and evidence better performance compared to current state-of-the-art methods. We provide interpretable representations of the multichotomy of space of crowd scene patches inferred from the switch. It is observed that the switch relays an image patch to a particular CNN column based on density of crowd.

Authors (3)

Deepak Babu Sam (5 papers)
Shiv Surya (3 papers)
R. Venkatesh Babu (108 papers)

Citations (862)

View on Semantic Scholar

Summary

Switching Convolutional Neural Network for Crowd Counting

The paper "Switching Convolutional Neural Network for Crowd Counting" by Deepak Babu Sam, Shiv Surya, and R. Venkatesh Babu introduces a novel architecture named Switch-CNN to address the challenges of crowd counting in computer vision. The primary innovation lies in leveraging the intra-image crowd density variations by employing independent CNN regressors alongside a switch classifier. This architecture promises more accurate and localized crowd density predictions.

Challenges in Crowd Counting

Crowd counting represents a complex problem due to several factors, including inter-occlusion between individuals, high similarities in visual appearance between crowd members and background elements, and large variations in camera viewpoints. Traditional methods ranging from HOG-based head detection to CNN-based regressors have been employed to tackle this issue; however, they often fall short due to the mentioned intricacies.

Proposed Architecture

Switch-CNN is designed to address these variations and shortcomings using the following core components:

Multiple CNN Regressors: Three CNN regressors with distinct receptive fields are trained to map a given crowd scene to its density. Each regressor is specialized to handle different scales and perspectives inherent in varying crowd distributions.
Switch Classifier: A classifier is trained to relay each image patch to the most suitable CNN regressor based on performance during the training phase. This component ensures that each patch of the image is analyzed by the optimal regressor, yielding more precise density localization.

The training process for the Switch-CNN involves several stages:

Pretraining: Individual regressors are pretrained on the entire training data to minimize count errors.
Differential Training: Each regressor is further fine-tuned on disjoint patches of the training data, ensuring each regressor specializes in particular patch attributes.
Switch Training: The switch classifier is trained to classify patches based on labels generated from differential training.
Coupled Training: Finally, the switch classifier and CNN regressors are trained alternately to ensure mutual adaptation.

Results

Switch-CNN exhibits state-of-the-art performance on several prominent datasets, namely ShanghaiTech, UCF_CC_50, UCSD, and WorldExpo’10. The experimental results demonstrate significant improvements in both MAE and MSE metrics across these datasets. For instance:

On the ShanghaiTech Part A dataset, Switch-CNN achieved an MAE of 90.4, outperforming the previous state-of-the-art model by 19.8 points.
On the challenging UCF_CC_50 dataset, an MAE of 318.1 was recorded, a marked improvement over the previous best of 333.73.

Analysis

The effectiveness of Switch-CNN is attributed to its ability to dynamically allocate image patches to the most appropriate regressor, thereby handling variations in crowd density and scale more efficiently. Additionally, the multichotomy inferred from differential training suggests a natural clustering of patches based on attributes like inter-head distance, which correlates with crowd density.

The investigation revealed that while manual clustering based on patch count or mean inter-head distance offers some improvements, the automatic clustering via differential training leads to superior performance. Coupled training further enhances the switch classifier's efficacy, highlighting the importance of co-adaptation between the switch and regressors.

Implications and Future Work

The implications of this research extend to real-world applications where accurate crowd density estimation forms a critical component, such as in urban planning and disaster management. The improved localization and counting precision can significantly aid in planning for events and managing large gatherings.

Future work may explore enhancements in switch accuracy and investigate the potential of more advanced classifiers. Additionally, extending this approach to video-based crowd counting could address temporal variations and improve robustness further.

In summary, the proposed Switch-CNN architecture sets a new benchmark in crowd counting with its innovative use of independent regressors and a dynamic switching mechanism to tackle the variability in crowd densities effectively.

PDF Markdown

Related Papers

Find Related Papers