TransCrowd: weakly-supervised crowd counting with transformers (2104.09116v3)

Published 19 Apr 2021 in cs.CV

Abstract: The mainstream crowd counting methods usually utilize the convolution neural network (CNN) to regress a density map, requiring point-level annotations. However, annotating each person with a point is an expensive and laborious process. During the testing phase, the point-level annotations are not considered to evaluate the counting accuracy, which means the point-level annotations are redundant. Hence, it is desirable to develop weakly-supervised counting methods that just rely on count-level annotations, a more economical way of labeling. Current weakly-supervised counting methods adopt the CNN to regress a total count of the crowd by an image-to-count paradigm. However, having limited receptive fields for context modeling is an intrinsic limitation of these weakly-supervised CNN-based methods. These methods thus cannot achieve satisfactory performance, with limited applications in the real world. The transformer is a popular sequence-to-sequence prediction model in NLP, which contains a global receptive field. In this paper, we propose TransCrowd, which reformulates the weakly-supervised crowd counting problem from the perspective of sequence-to-count based on transformers. We observe that the proposed TransCrowd can effectively extract the semantic crowd information by using the self-attention mechanism of transformer. To the best of our knowledge, this is the first work to adopt a pure transformer for crowd counting research. Experiments on five benchmark datasets demonstrate that the proposed TransCrowd achieves superior performance compared with all the weakly-supervised CNN-based counting methods and gains highly competitive counting performance compared with some popular fully-supervised counting methods.

Citations (161)

View on Semantic Scholar

Summary

The paper introduces TransCrowd, a novel Transformer-based framework for weakly-supervised crowd counting that relies solely on count-level annotations.
It reformulates crowd counting as a sequence-to-count problem, employing self-attention to capture global contextual information within images.
Experiments on benchmarks show significant improvements, with TransCrowd-GAP reducing MAE by up to 17.5% compared to prior weakly-supervised methods.

Overview of "TransCrowd: Weakly-Supervised Crowd Counting with Transformers"

The paper introduces a novel approach, TransCrowd, aiming to address the challenges of crowd counting in images using a weakly-supervised framework aided by Transformers. Traditional crowd counting methods frequently rely on convolutional neural networks (CNNs) that necessitate point-level annotations for regressing density maps. While these methods have achieved significant proficiency, they require extensive and labor-intensive point-level annotations. Furthermore, during testing, these point-level annotations do not contribute to evaluating accuracy, rendering them somewhat redundant. Consequently, developing methods that rely on count-level annotations alone becomes economically favorable.

Motivations and Methodology

The proposed methodology diverges from traditional CNN-based paradigms, which possess inherently limited receptive fields. These limitations constrain context modeling, thereby affecting performance. Transformers, in contrast, offer a global receptive field by leveraging a sequence-to-sequence prediction architecture. This paper posits that Transformers can beneficially reformulate crowd counting into a sequence-to-count problem, making the first attempt to apply pure Transformers to this domain.

TransCrowd leverages the self-attention mechanism in Transformers to effectively extract semantic information from crowd images. This reformulation introduces two variants: TransCrowd-Token and TransCrowd-GAP. The former utilizes an additional learnable token to facilitate counting, while the latter employs a global average pooling approach over the output sequence from the Transformer encoder.

Quantitative and Qualitative Results

Experiments conducted on five benchmark datasets demonstrate that TransCrowd significantly outperforms traditional weakly-supervised CNN-based methods and remains competitive with strong fully-supervised methods. For instance, on the ShanghaiTech Part A dataset, TransCrowd-GAP improves MAE by 17.5% and MSE by 18.8% compared to formerly existing weakly-supervised approaches. This highlights TransCrowd's prowess in extracting and leveraging semantic information from non-point-level annotated data.

The attention mechanism inherent in Transformers provides a mechanism for focusing on relevant crowd areas within images. Visualizations of the attention maps revealed that TransCrowd-GAP produces more accurate attention weights compared to TransCrowd-Token, further substantiating its effective information processing capabilities.

Implications and Future Directions

The implications of this research extend significantly into practical applications. By relying solely on count-level annotations, TransCrowd reduces the annotation burden, making the deployment of such models in various real-world scenarios more feasible. The competitive performance compared to fully-supervised models suggests that comprehensive point-level data might not be strictly necessary for achieving high accuracy in crowd counting tasks.

Looking ahead, the research can be extended to explore fully-supervised counting tasks using Transformer architectures. Additionally, the potential application of similar Sequence-to-Count methodologies in video-based crowd analysis presents a promising avenue. Given the Transformer's advantage in processing sequential data, there exists potential for capturing temporal dynamics within crowd scenes, enhancing performance further.

In summary, the introduction of TransCrowd marks a sophisticated step in crowd counting methodology, favorably utilizing Transformers' capabilities to address the intrinsic challenges of the task. This approach not only aims to simplify annotation processes but also stands to significantly impact real-world applications where crowds are dynamic and diverse.

PDF Markdown