HA-CCN: Hierarchical Attention-based Crowd Counting Network (1907.10255v1)

Published 24 Jul 2019 in cs.CV

Abstract: Single image-based crowd counting has recently witnessed increased focus, but many leading methods are far from optimal, especially in highly congested scenes. In this paper, we present Hierarchical Attention-based Crowd Counting Network (HA-CCN) that employs attention mechanisms at various levels to selectively enhance the features of the network. The proposed method, which is based on the VGG16 network, consists of a spatial attention module (SAM) and a set of global attention modules (GAM). SAM enhances low-level features in the network by infusing spatial segmentation information, whereas the GAM focuses on enhancing channel-wise information in the higher level layers. The proposed method is a single-step training framework, simple to implement and achieves state-of-the-art results on different datasets. Furthermore, we extend the proposed counting network by introducing a novel set-up to adapt the network to different scenes and datasets via weak supervision using image-level labels. This new set up reduces the burden of acquiring labour intensive point-wise annotations for new datasets while improving the cross-dataset performance.

Citations (176)

View on Semantic Scholar

Summary

Hierarchical Attention-Based Crowd Counting Network (HA-CCN) Overview

The paper introduces a novel approach for crowd counting using the Hierarchical Attention-based Crowd Counting Network (HA-CCN), a method that significantly enhances the performance of crowd counting tasks, especially in highly congested scenes. The research focuses on overcoming traditional challenges in this domain, such as perspective distortion, scale variations, heavy occlusion, illumination changes, and non-uniform distribution of people.

The proposed HA-CCN framework leverages attention mechanisms at multiple levels to refine feature extraction within the network. Specifically, it introduces two key components: the Spatial Attention Module (SAM) and the Global Attention Module (GAM), both of which are integrated into a VGG16-based network architecture.

Technical Specifics and Components

Spatial Attention Module (SAM): This module enriches low-level features by integrating spatial segmentation information, thereby enhancing the network's ability to prioritize relevant spatial regions early in the feature extraction process. Contrary to self-supervised techniques, SAM employs explicit foreground-background segmentation, leading to faster convergence and improved feature discrimination.
Global Attention Modules (GAMs): GAMs are employed in the higher layers of the network (conv4 and conv5), where they enhance channel-wise features by selectively focusing on significant channels and suppressing less relevant ones. This strategic channel-wise focusing contributes to the network's ability to effectively handle complex crowd density variations.

Notably, HA-CCN incorporates a single-step training framework, making it simpler to implement compared to other multi-stage training methods that are often cumbersome. Performance wise, HA-CCN achieves state-of-the-art results across multiple benchmark datasets, as illustrated by its performance on datasets like ShanghaiTech, UCF-QNRF, and UCF.

Comparative Analysis

In comparison with previous methods such as Switch-CNN, CP-CNN, and CSRNet, HA-CCN demonstrates superior performance across various metrics. For example, it achieves a MAE of 62.9 and a MSE of 94.9 on the ShanghaiTech Part A dataset, outperforming both traditional and more recent approaches that involve complex multi-column architectures and sophisticated learning paradigms.

Advancements in Weakly Supervised Learning

A notable feature of this research is its exploration of domain adaptation through a weakly supervised learning setup. The authors introduce a framework that uses image-level labels to adapt pre-trained models to new datasets, reducing the burden of acquiring point-wise annotations. This approach makes use of class activation maps to produce pseudo ground-truth density maps, providing a viable path for improving cross-dataset generalization without requiring extensive new annotations.

Implications and Future Directions

The implications of HA-CCN's design are both practical and theoretical. Practically, this approach offers a scalable and efficient method for deploying crowd counting systems in diverse environments, particularly where high-density precision is crucial. Theoretically, it opens new avenues for integrating attention-based mechanisms more deeply into feature extraction processes, suggesting potential explorations in other computer vision tasks beyond crowd counting.

From a future development perspective, extending this hierarchical attention framework to different backbone networks could further enhance its applicability and robustness. Further research into innovative semi-supervised or weakly supervised methodologies could also enhance the network's adaptability to new domains, offering increased resilience against dataset biases.

In conclusion, the HA-CCN represents a significant methodological advancement in the field of crowd counting, demonstrating notable improvements in accuracy and efficiency. Its attention-based architecture provides critical insights into the evolution of feature enhancement techniques in complex visual domains.