CLIP-EBC: CLIP Can Count Accurately through Enhanced Blockwise Classification (2403.09281v2)
Abstract: We propose CLIP-EBC, the first fully CLIP-based model for accurate crowd density estimation. While the CLIP model has demonstrated remarkable success in addressing recognition tasks such as zero-shot image classification, its potential for counting has been largely unexplored due to the inherent challenges in transforming a regression problem, such as counting, into a recognition task. In this work, we investigate and enhance CLIP's ability to count, focusing specifically on the task of estimating crowd sizes from images. Existing classification-based crowd-counting frameworks have significant limitations, including the quantization of count values into bordering real-valued bins and the sole focus on classification errors. These practices result in label ambiguity near the shared borders and inaccurate prediction of count values. Hence, directly applying CLIP within these frameworks may yield suboptimal performance. To address these challenges, we first propose the Enhanced Blockwise Classification (EBC) framework. Unlike previous methods, EBC utilizes integer-valued bins, effectively reducing ambiguity near bin boundaries. Additionally, it incorporates a regression loss based on density maps to improve the prediction of count values. Within our backbone-agnostic EBC framework, we then introduce CLIP-EBC to fully leverage CLIP's recognition capabilities for this task. Extensive experiments demonstrate the effectiveness of EBC and the competitive performance of CLIP-EBC. Specifically, our EBC framework can improve existing classification-based methods by up to 44.5% on the UCF-QNRF dataset, and CLIP-EBC achieves state-of-the-art performance on the NWPU-Crowd test set, with an MAE of 58.2 and an RMSE of 268.5, representing improvements of 8.6% and 13.3% over the previous best method, STEERER. The code and weights are available at https://github.com/Yiming-M/CLIP-EBC.
- “Vision-based crowd counting and social distancing monitoring using Tiny-YOLOv4 and DeepSORT” In 2021 IEEE International Smart Cities Conference (ISC2), 2021, pp. 1–7 IEEE
- Sami Abdulla Mohsen Saleh, Shahrel Azmin Suandi and Haidi Ibrahim “Recent survey on crowd density estimation and counting for visual surveillance” In Engineering Applications of Artificial Intelligence 41 Elsevier, 2015, pp. 103–114
- Yuhong Li, Xiaofan Zhang and Deming Chen “Csrnet: Dilated convolutional neural networks for understanding the highly congested scenes” In Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 1091–1100
- Weizhe Liu, Mathieu Salzmann and Pascal Fua “Context-aware crowd counting” In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 5099–5108
- “Bayesian loss for crowd count estimation with point supervision” In Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 6142–6151
- “Distribution matching for crowd counting” In Advances in neural information processing systems 33, 2020, pp. 1595–1607
- “Encoder-decoder based convolutional neural networks with multi-scale-aware modules for crowd counting” In 2020 25th international conference on pattern recognition (ICPR), 2021, pp. 2382–2389 IEEE
- “To choose or to fuse? scale selection for crowd counting” In Proceedings of the AAAI conference on artificial intelligence 35, 2021, pp. 2576–2583
- Yiming Ma, Victor Sanchez and Tanaya Guha “Fusioncount: efficient crowd counting via multiscale feature fusion” In 2022 IEEE International Conference on Image Processing (ICIP), 2022, pp. 3256–3260 IEEE
- “Learning transferable visual models from natural language supervision” In International conference on machine learning, 2021, pp. 8748–8763 PMLR
- “Counting objects by blockwise classification” In IEEE Transactions on Circuits and Systems for Video Technology 30.10 IEEE, 2019, pp. 3513–3527
- “From open set to closed set: Counting objects by spatial divide-and-conquer” In Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 8362–8371
- “Open-vocabulary Object Detection via Vision and Language Knowledge Distillation” In International Conference on Learning Representations, 2022 URL: https://openreview.net/forum?id=lL3lnMbR4WU
- “Image segmentation using text and image prompts” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 7086–7096
- “CrowdCLIP: Unsupervised Crowd Counting via Vision-Language Model” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 2893–2903
- Ruixiang Jiang, Lingbo Liu and Changwen Chen “CLIP-Count: Towards Text-Guided Zero-Shot Object Counting” In arXiv preprint arXiv:2305.07304, 2023
- “NWPU-crowd: A large-scale benchmark for crowd counting and localization” In IEEE transactions on pattern analysis and machine intelligence 43.6 IEEE, 2020, pp. 2141–2149
- “Single-image crowd counting via multi-column convolutional neural network” In Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 589–597
- “Composition loss for counting, density map estimation and localization in dense crowds” In Proceedings of the European conference on computer vision (ECCV), 2018, pp. 532–546
- “Very deep convolutional networks for large-scale image recognition” In 3rd International Conference on Learning Representations (ICLR 2015), 2015 ComputationalBiological Learning Society
- “You only look once: Unified, real-time object detection” In Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 779–788
- Diederik P Kingma and Jimmy Ba “Adam: A method for stochastic optimization” In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015 URL: http://arxiv.org/abs/1412.6980
- “SGDR: Stochastic Gradient Descent with Warm Restarts” In International Conference on Learning Representations, 2017 URL: https://openreview.net/forum?id=Skq89Scxx
- Vishwanath A Sindagi and Vishal M Patel “Cnn-based cascaded multi-task learning of high-level prior and density estimation for crowd counting” In 2017 14th IEEE international conference on advanced video and signal based surveillance (AVSS), 2017, pp. 1–6 IEEE
- “Discrete-constrained regression for local counting models” In European Conference on Computer Vision, 2022, pp. 621–636 Springer
- “Deep residual learning for image recognition” In Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778
- “Mobilenetv2: Inverted residuals and linear bottlenecks” In Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 4510–4520
- “Densely connected convolutional networks” In Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4700–4708
- Yiming Ma (29 papers)
- Victor Sanchez (46 papers)
- Tanaya Guha (30 papers)