Encoder-Decoder Based Convolutional Neural Networks with Multi-Scale-Aware Modules for Crowd Counting (2003.05586v5)

Published 12 Mar 2020 in cs.CV and cs.AI

Abstract: In this paper, we propose two modified neural networks based on dual path multi-scale fusion networks (SFANet) and SegNet for accurate and efficient crowd counting. Inspired by SFANet, the first model, which is named M-SFANet, is attached with atrous spatial pyramid pooling (ASPP) and context-aware module (CAN). The encoder of M-SFANet is enhanced with ASPP containing parallel atrous convolutional layers with different sampling rates and hence able to extract multi-scale features of the target object and incorporate larger context. To further deal with scale variation throughout an input image, we leverage the CAN module which adaptively encodes the scales of the contextual information. The combination yields an effective model for counting in both dense and sparse crowd scenes. Based on the SFANet decoder structure, M-SFANet's decoder has dual paths, for density map and attention map generation. The second model is called M-SegNet, which is produced by replacing the bilinear upsampling in SFANet with max unpooling that is used in SegNet. This change provides a faster model while providing competitive counting performance. Designed for high-speed surveillance applications, M-SegNet has no additional multi-scale-aware module in order to not increase the complexity. Both models are encoder-decoder based architectures and are end-to-end trainable. We conduct extensive experiments on five crowd counting datasets and one vehicle counting dataset to show that these modifications yield algorithms that could improve state-of-the-art crowd counting methods. Codes are available at https://github.com/Pongpisit-Thanasutives/Variations-of-SFANet-for-Crowd-Counting.

Citations (61)

View on Semantic Scholar

Summary

The paper introduces two novel architectures that enhance crowd counting by integrating multi-scale-aware modules for better context and feature extraction.
It leverages atrous spatial pyramid pooling and context-aware modules within M-SFANet to address scale variations and dense scenes.
Both models are end-to-end trainable and validated on multiple datasets, demonstrating increased accuracy and speed for real-time surveillance applications.

Overview of Encoder-Decoder Based Convolutional Neural Networks with Multi-Scale-Aware Modules for Crowd Counting

The paper presents modifications to convolutional neural network architectures specifically aimed at improving crowd counting accuracy and efficiency. It introduces two novel models, M-SFANet and M-SegNet, derived from enhancements to existing methods like SFANet and SegNet.

Key Contributions

M-SFANet Architecture: This model builds upon the SFANet framework by integrating atrous spatial pyramid pooling (ASPP) and a context-aware module (CAN) into the VGG16-bn encoder. These additions allow the encoder to capture multi-scale features and larger context, tackling complexities such as scale variation and densely packed scenes. The dual-path decoder facilitates density map and attention map generation, helping to filter out background noise.
M-SegNet Architecture: Adapted from SegNet, M-SegNet replaces bilinear upsampling with max unpooling to achieve faster computational performance while maintaining competitive counting accuracy. Designed with surveillance applications in mind, this model avoids the complexity of multi-scale-aware modules to prioritize speed.
End-to-End Trainability: Both models are designed as encoder-decoder-based architectures that are fully trainable end-to-end, facilitating straightforward implementation and optimization.

Evaluation and Results

The authors conduct rigorous experiments on five crowd counting datasets and a vehicle counting dataset, demonstrating significant improvements in accuracy over state-of-the-art methods. Notably:

ShanghaiTech Part A and B: M-SFANet and M-SegNet deliver competitive results, with M-SFANet showing impressive performance in dense scenes due to its multi-scale and context-aware modules.
UCF-CC-50: Despite the limited dataset size, M-SFANet achieves a substantial reduction in mean absolute error compared to the leading alternative approaches.
UCF-QNRF: By integrating the Bayesian Loss approach, M-SFANet* notably improves counting accuracy on this highly challenging dataset.
TRANCOS Vehicle Dataset: The reported experiments indicate the robustness of M-SFANet in scenarios beyond human crowd counting, highlighting its generalization capabilities.

Implications

The proposed models represent meaningful advancements in handling crowd scenes characterized by significant variations in density and camera perspectives, common challenges in crowd counting tasks. By incorporating multi-scale and contextual information directly into the model architecture, these methods enable enhanced feature extraction, especially beneficial in complex environments.

Future Directions

The authors suggest potential improvements by integrating adaptive mechanisms into the multi-scale-aware modules, allowing dynamic adjustment of sampling and dilation rates. Such modifications could offer further gains in accuracy across diverse, unseen datasets.

Conclusion

The paper effectively demonstrates how innovations in neural network architecture, such as enhanced multi-scale feature integration and efficient upsampling mechanisms, can significantly improve crowd counting performance, offering practical solutions for real-time surveillance and monitoring applications. The focus on balancing computational efficiency with counting accuracy shows promise for future developments in urban traffic and public safety management systems, where scalability and speed are paramount.

PDF Markdown

Related Papers

GitHub

GitHub - Pongpisit-Thanasutives/Variations-of-SFANet-for-Crowd-Counting: The official implementation of "Encoder-Decoder Based Convolutional Neural Networks with Multi-Scale-Aware Modules for Crowd Counting" (109 stars)

YouTube

Show All Videos