Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CLIP-EBC: CLIP Can Count Accurately through Enhanced Blockwise Classification (2403.09281v2)

Published 14 Mar 2024 in cs.CV

Abstract: We propose CLIP-EBC, the first fully CLIP-based model for accurate crowd density estimation. While the CLIP model has demonstrated remarkable success in addressing recognition tasks such as zero-shot image classification, its potential for counting has been largely unexplored due to the inherent challenges in transforming a regression problem, such as counting, into a recognition task. In this work, we investigate and enhance CLIP's ability to count, focusing specifically on the task of estimating crowd sizes from images. Existing classification-based crowd-counting frameworks have significant limitations, including the quantization of count values into bordering real-valued bins and the sole focus on classification errors. These practices result in label ambiguity near the shared borders and inaccurate prediction of count values. Hence, directly applying CLIP within these frameworks may yield suboptimal performance. To address these challenges, we first propose the Enhanced Blockwise Classification (EBC) framework. Unlike previous methods, EBC utilizes integer-valued bins, effectively reducing ambiguity near bin boundaries. Additionally, it incorporates a regression loss based on density maps to improve the prediction of count values. Within our backbone-agnostic EBC framework, we then introduce CLIP-EBC to fully leverage CLIP's recognition capabilities for this task. Extensive experiments demonstrate the effectiveness of EBC and the competitive performance of CLIP-EBC. Specifically, our EBC framework can improve existing classification-based methods by up to 44.5% on the UCF-QNRF dataset, and CLIP-EBC achieves state-of-the-art performance on the NWPU-Crowd test set, with an MAE of 58.2 and an RMSE of 268.5, representing improvements of 8.6% and 13.3% over the previous best method, STEERER. The code and weights are available at https://github.com/Yiming-M/CLIP-EBC.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (28)
  1. “Vision-based crowd counting and social distancing monitoring using Tiny-YOLOv4 and DeepSORT” In 2021 IEEE International Smart Cities Conference (ISC2), 2021, pp. 1–7 IEEE
  2. Sami Abdulla Mohsen Saleh, Shahrel Azmin Suandi and Haidi Ibrahim “Recent survey on crowd density estimation and counting for visual surveillance” In Engineering Applications of Artificial Intelligence 41 Elsevier, 2015, pp. 103–114
  3. Yuhong Li, Xiaofan Zhang and Deming Chen “Csrnet: Dilated convolutional neural networks for understanding the highly congested scenes” In Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 1091–1100
  4. Weizhe Liu, Mathieu Salzmann and Pascal Fua “Context-aware crowd counting” In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 5099–5108
  5. “Bayesian loss for crowd count estimation with point supervision” In Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 6142–6151
  6. “Distribution matching for crowd counting” In Advances in neural information processing systems 33, 2020, pp. 1595–1607
  7. “Encoder-decoder based convolutional neural networks with multi-scale-aware modules for crowd counting” In 2020 25th international conference on pattern recognition (ICPR), 2021, pp. 2382–2389 IEEE
  8. “To choose or to fuse? scale selection for crowd counting” In Proceedings of the AAAI conference on artificial intelligence 35, 2021, pp. 2576–2583
  9. Yiming Ma, Victor Sanchez and Tanaya Guha “Fusioncount: efficient crowd counting via multiscale feature fusion” In 2022 IEEE International Conference on Image Processing (ICIP), 2022, pp. 3256–3260 IEEE
  10. “Learning transferable visual models from natural language supervision” In International conference on machine learning, 2021, pp. 8748–8763 PMLR
  11. “Counting objects by blockwise classification” In IEEE Transactions on Circuits and Systems for Video Technology 30.10 IEEE, 2019, pp. 3513–3527
  12. “From open set to closed set: Counting objects by spatial divide-and-conquer” In Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 8362–8371
  13. “Open-vocabulary Object Detection via Vision and Language Knowledge Distillation” In International Conference on Learning Representations, 2022 URL: https://openreview.net/forum?id=lL3lnMbR4WU
  14. “Image segmentation using text and image prompts” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 7086–7096
  15. “CrowdCLIP: Unsupervised Crowd Counting via Vision-Language Model” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 2893–2903
  16. Ruixiang Jiang, Lingbo Liu and Changwen Chen “CLIP-Count: Towards Text-Guided Zero-Shot Object Counting” In arXiv preprint arXiv:2305.07304, 2023
  17. “NWPU-crowd: A large-scale benchmark for crowd counting and localization” In IEEE transactions on pattern analysis and machine intelligence 43.6 IEEE, 2020, pp. 2141–2149
  18. “Single-image crowd counting via multi-column convolutional neural network” In Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 589–597
  19. “Composition loss for counting, density map estimation and localization in dense crowds” In Proceedings of the European conference on computer vision (ECCV), 2018, pp. 532–546
  20. “Very deep convolutional networks for large-scale image recognition” In 3rd International Conference on Learning Representations (ICLR 2015), 2015 ComputationalBiological Learning Society
  21. “You only look once: Unified, real-time object detection” In Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 779–788
  22. Diederik P Kingma and Jimmy Ba “Adam: A method for stochastic optimization” In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015 URL: http://arxiv.org/abs/1412.6980
  23. “SGDR: Stochastic Gradient Descent with Warm Restarts” In International Conference on Learning Representations, 2017 URL: https://openreview.net/forum?id=Skq89Scxx
  24. Vishwanath A Sindagi and Vishal M Patel “Cnn-based cascaded multi-task learning of high-level prior and density estimation for crowd counting” In 2017 14th IEEE international conference on advanced video and signal based surveillance (AVSS), 2017, pp. 1–6 IEEE
  25. “Discrete-constrained regression for local counting models” In European Conference on Computer Vision, 2022, pp. 621–636 Springer
  26. “Deep residual learning for image recognition” In Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778
  27. “Mobilenetv2: Inverted residuals and linear bottlenecks” In Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 4510–4520
  28. “Densely connected convolutional networks” In Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4700–4708
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Yiming Ma (29 papers)
  2. Victor Sanchez (46 papers)
  3. Tanaya Guha (30 papers)
Citations (1)