CrowdCLIP: Unsupervised Crowd Counting via Vision-Language Model (2304.04231v1)

Published 9 Apr 2023 in cs.CV

Abstract: Supervised crowd counting relies heavily on costly manual labeling, which is difficult and expensive, especially in dense scenes. To alleviate the problem, we propose a novel unsupervised framework for crowd counting, named CrowdCLIP. The core idea is built on two observations: 1) the recent contrastive pre-trained vision-LLM (CLIP) has presented impressive performance on various downstream tasks; 2) there is a natural mapping between crowd patches and count text. To the best of our knowledge, CrowdCLIP is the first to investigate the vision language knowledge to solve the counting problem. Specifically, in the training stage, we exploit the multi-modal ranking loss by constructing ranking text prompts to match the size-sorted crowd patches to guide the image encoder learning. In the testing stage, to deal with the diversity of image patches, we propose a simple yet effective progressive filtering strategy to first select the highly potential crowd patches and then map them into the language space with various counting intervals. Extensive experiments on five challenging datasets demonstrate that the proposed CrowdCLIP achieves superior performance compared to previous unsupervised state-of-the-art counting methods. Notably, CrowdCLIP even surpasses some popular fully-supervised methods under the cross-dataset setting. The source code will be available at https://github.com/dk-liang/CrowdCLIP.

PDF Abstract

Unsupervised Crowd Counting via Vision-LLMs: An Exploration with CrowdCLIP

The paper "CrowdCLIP: Unsupervised Crowd Counting via Vision-LLM" introduces a novel approach to crowd counting that leverages vision-language pre-trained models, specifically CLIP, for unsupervised learning. The primary motivation is to address the intensive manual labeling required for supervised crowd counting, especially in dense scenes, by providing a framework that does not rely on labeled data.

Core Concept and Methodology

CrowdCLIP is built on two foundational observations. First, CLIP, a vision-LLM, demonstrates significant performance in diverse downstream tasks by exploiting its robust image-text correlation learning. Second, there is an inherent mapping between crowd image patches and textual representations of count intervals. The paper pioneers the use of vision-language knowledge to tackle the crowd counting problem without labeled supervision.

Training Phase: The authors introduce a ranking-based fine-tuning strategy. Here, multiple image patches are sorted by size and are paired with ranking text prompts. The image encoder is refined using a multi-modal ranking loss, which ensures the patches are effectively mapped into the language space defined by the CLIP text encoder. This approach harnesses CLIP's ability to understand semantic correlations between visual and textual data.

Testing Phase: A progressive filtering strategy is proposed, which consists of three stages to select the relevant crowd patches before mapping them into count intervals. Initially, the model filters potential crowd-related patches, refines them to emphasize human heads, and finally predicts the count via the fine-tuned encoder.

Results and Comparative Analysis

The experimental results across five benchmarking datasets (UCF-QNRF, JHU-Crowd++, ShanghaiTech Part A, Part B, and UCF_CC_50) reveal that CrowdCLIP achieves remarkable accuracy, notably surpassing previous unsupervised methods. On dense datasets like UCF-QNRF, CrowdCLIP improves MAE by 35.2% compared to CSS-CCNN—the prior best-performing unsupervised method—highlighting its effectiveness in extracting crowd semantics under challenging conditions.

Furthermore, the model illustrates strong cross-dataset generalization properties, comparable to some fully-supervised methods. For instance, when adapted from one dataset to another (ShanghaiTech Part A to B), CrowdCLIP showcases competitive transferability, supporting its robustness across varied crowd scenes.

Implications and Future Directions

Practically, CrowdCLIP offers a viable solution for scenarios where large-scale labeled data is infeasible, making it particularly beneficial for real-time applications such as surveillance and urban management. Theoretically, it expands the frontier of unsupervised learning by integrating vision-LLMs for dense prediction tasks, which might inspire future innovations in similar domains.

Looking ahead, the exploration of integrating localization capabilities into such unsupervised frameworks might provide comprehensive crowd analysis, addressing not only counting but also spatial distribution and dynamics, thus improving overall crowd management systems. Additionally, enhancing computational efficiency could broaden its application scope, enabling deployment on resource-constrained devices or scenarios demanding rapid inference.

In summary, CrowdCLIP represents a significant stride in unsupervised crowd counting, offering a performance-validated framework that combines vision-LLMs with innovative ranking and filtering strategies.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Dingkang Liang (37 papers)
Jiahao Xie (22 papers)
Zhikang Zou (25 papers)
Xiaoqing Ye (42 papers)
Wei Xu (536 papers)
Xiang Bai (222 papers)

Citations (37)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - dk-liang/CrowdCLIP: [CVPR 2023] CrowdCLIP: Unsupervised Crowd Counting via Vision-Language Model (78 stars)