Crowd counting via scale-adaptive convolutional neural network (1711.04433v4)

Published 13 Nov 2017 in cs.CV

Abstract: The task of crowd counting is to automatically estimate the pedestrian number in crowd images. To cope with the scale and perspective changes that commonly exist in crowd images, state-of-the-art approaches employ multi-column CNN architectures to regress density maps of crowd images. Multiple columns have different receptive fields corresponding to pedestrians (heads) of different scales. We instead propose a scale-adaptive CNN (SaCNN) architecture with a backbone of fixed small receptive fields. We extract feature maps from multiple layers and adapt them to have the same output size; we combine them to produce the final density map. The number of people is computed by integrating the density map. We also introduce a relative count loss along with the density map loss to improve the network generalization on crowd scenes with few pedestrians, where most representative approaches perform poorly on. We conduct extensive experiments on the ShanghaiTech, UCF_CC_50 and WorldExpo datasets as well as a new dataset SmartCity that we collect for crowd scenes with few people. The results demonstrate significant improvements of SaCNN over the state-of-the-art.

Authors (3)

Lu Zhang (373 papers)
Miaojing Shi (53 papers)
Qiaobo Chen (3 papers)

Citations (235)

View on Semantic Scholar

Summary

The paper introduces a novel single-column CNN architecture that adaptively fuses multi-scale feature maps to enhance crowd counting accuracy.
It employs a dual-loss framework combining density map and relative count losses to improve performance in both dense and sparse scenarios.
Experimental evaluation shows lower MAE and MSE on benchmarks, including a new SmartCity dataset, confirming the model's robust performance.

Analysis of "Crowd Counting via Scale-Adaptive Convolutional Neural Network"

The paper introduces a novel approach to the task of crowd counting, proposing the Scale-Adaptive Convolutional Neural Network (SaCNN). This approach addresses common challenges in crowd analysis, particularly the variable scales and perspectives inherent in crowd images, which strain traditional computer vision models. Unlike conventional methods that rely on multi-column convolutional neural networks (CNNs) with different receptive fields, SaCNN employs a single-column architecture with adaptive features to improve counting accuracy and model efficiency.

Methodological Overview

The proposed architecture of SaCNN features a single-column network with consistently small receptive fields. This design choice facilitates the extraction of high-resolution spatial features, allowing for deeper network construction while maintaining the capacity to generalize across images with varied scales and perspectives. Core to SaCNN’s architecture is the combination of multi-scale feature maps from various network layers, harmonized to produce an effective density map, which, when integrated, yields the pedestrian count.

Key innovations of SaCNN include:

Low-Level Feature Sharing and Multi-Scale Adaptation: By leveraging shared low-level features across layers, SaCNN reduces parameter count, training requirements, and improves processing speed. The incorporation of deconvolutional layers ensures congruent output sizes for effective multi-scale feature map fusion.
Loss Function Design: The paper introduces a dual-loss framework featuring a density map loss alongside a relative count loss. This architecture aims to improve model performance in scenarios characterized by sparse pedestrian distributions, where absolute count estimates are critically hindered by noise and variability.
New Dataset Introduction: A new 'SmartCity' dataset is crafted specifically to evaluate model performance on sparse pedestrian scenes, supplementing established benchmarks like ShanghaiTech, UCF_CC_50, and WorldExpo’10.

Performance Evaluation

Extensive experimental evaluations showcase SaCNN's superior performance in comparative analyses against leading state-of-the-art models. SaCNN demonstrates notably lower Mean Absolute Error (MAE) and Mean Squared Error (MSE) scores across multiple datasets, including significant outperformance in both dense and sparse scenarios.

Key Results

ShanghaiTech Dataset: SaCNN achieves an MAE of 86.8 and 16.2 for PartA and PartB, respectively.
WorldExpo'10 Dataset: An average MAE of 8.5 is attained, surpassing prior methods.
UCF_CC_50 Dataset: Leads with an MAE of 314.9 and MSE of 424.8.
SmartCity Dataset: Demonstrates effectiveness in sparse environments with an MAE of 8.6.

The integration of a relative count loss in SaCNN is experimentally validated via comparative trials, demonstrating marked improvements particularly in handling sparse crowd scenes.

Theoretical and Practical Implications

The research posits SaCNN as a versatile tool in crowd counting, demonstrating its ability to handle widely varying consistencies of pedestrian count and dispersion uniformly. The enhancements in speed and accuracy offered by SaCNN make it particularly suitable for real-time applications in surveillance and public event management, where efficient crowd size estimation is paramount.

Future Directions

The paper suggests potential improvements by incorporating perspective information directly into the network as a learnable weighting mechanism. This could further enhance SaCNN's adaptability to different viewing angles and distances in crowd images. Advancing this innovation could refine density map regression accuracy yet further, bridging gaps in performance experienced by conventional fixed-scale models.

In sum, SaCNN represents a robust framework for crowd estimation tasks, advancing the state-of-the-art through innovative architecture and loss design, with a responsible evaluation against an extensively varied dataset collection.

PDF Markdown