Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
175 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

VLCounter: Text-aware Visual Representation for Zero-Shot Object Counting (2312.16580v2)

Published 27 Dec 2023 in cs.CV

Abstract: Zero-Shot Object Counting (ZSOC) aims to count referred instances of arbitrary classes in a query image without human-annotated exemplars. To deal with ZSOC, preceding studies proposed a two-stage pipeline: discovering exemplars and counting. However, there remains a challenge of vulnerability to error propagation of the sequentially designed two-stage process. In this work, an one-stage baseline, Visual-Language Baseline (VLBase), exploring the implicit association of the semantic-patch embeddings of CLIP is proposed. Subsequently, the extension of VLBase to Visual-language Counter (VLCounter) is achieved by incorporating three modules devised to tailor VLBase for object counting. First, Semantic-conditioned Prompt Tuning (SPT) is introduced within the image encoder to acquire target-highlighted representations. Second, Learnable Affine Transformation (LAT) is employed to translate the semantic-patch similarity map to be appropriate for the counting task. Lastly, the layer-wisely encoded features are transferred to the decoder through Segment-aware Skip Connection (SaSC) to keep the generalization capability for unseen classes. Through extensive experiments on FSC147, CARPK, and PUCPR+, the benefits of the end-to-end framework, VLCounter, are demonstrated.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. Localization in the crowd with topological constraints. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, 872–881.
  2. Counting in the wild. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part VII 14, 483–498. Springer.
  3. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877–1901.
  4. Privacy preserving crowd monitoring: Counting people without people models or tracking. In 2008 IEEE conference on computer vision and pattern recognition, 1–7. IEEE.
  5. Counting everyday objects in everyday scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, 1135–1144.
  6. Object counting and instance segmentation with image-level supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12397–12405.
  7. Class-agnostic object counting robust to intraclass diversity. In European Conference on Computer Vision, 388–403. Springer.
  8. Ppt: Pre-trained prompt tuning for few-shot learning. arXiv preprint arXiv:2109.04332.
  9. Ranking info noise contrastive estimation: Boosting contrastive learning via ranked positives. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, 897–905.
  10. Drone-based object counting by spatially regularized regional proposal network. In Proceedings of the IEEE international conference on computer vision, 4145–4153.
  11. Composition loss for counting, density map estimation and localization in dense crowds. In Proceedings of the European conference on computer vision (ECCV), 532–546.
  12. Visual prompt tuning. In European Conference on Computer Vision, 709–727. Springer.
  13. CLIP-Count: Towards Text-Guided Zero-Shot Object Counting. arXiv preprint arXiv:2305.07304.
  14. Where are the blobs: Counting by localization with point supervision. In Proceedings of the european conference on computer vision (ECCV), 547–562.
  15. Language-driven Semantic Segmentation. In International Conference on Learning Representations.
  16. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190.
  17. Clip surgery for better explainability with enhancement in open-vocabulary tasks. arXiv preprint arXiv:2304.05653.
  18. Density map regression guided detection network for rgb-d crowd counting and localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1821–1830.
  19. Context-aware crowd counting. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 5099–5108.
  20. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
  21. Class-agnostic counting. In Computer Vision–ACCV 2018: 14th Asian Conference on Computer Vision, Perth, Australia, December 2–6, 2018, Revised Selected Papers, Part III 14, 669–684. Springer.
  22. Image segmentation using text and image prompts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7086–7096.
  23. Query-dependent video representation for moment retrieval and highlight detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 23023–23033.
  24. A large contextual dataset for classification, detection and counting of cars with deep learning. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III 14, 785–800. Springer.
  25. Zero-shot temporal action detection via vision-language prompting. In European Conference on Computer Vision, 681–697. Springer.
  26. Learning transferable visual models from natural language supervision. In International conference on machine learning, 8748–8763. PMLR.
  27. Exemplar free class agnostic counting. In Proceedings of the Asian Conference on Computer Vision, 3121–3137.
  28. Learning to count everything. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3394–3403.
  29. Denseclip: Language-guided dense prediction with context-aware prompting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 18082–18091.
  30. Represent, compare, and learn: A similarity-aware framework for class-agnostic counting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9529–9538.
  31. Pushing the frontiers of unconstrained crowd counting: New dataset and benchmark method. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 1221–1231.
  32. Distribution matching for crowd counting. Advances in neural information processing systems, 33: 1595–1607.
  33. Learning to prompt for continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 139–149.
  34. Microscopy cell counting and detection with fully convolutional regression networks. Computer methods in biomechanics and biomedical engineering: Imaging & Visualization, 6(3): 283–292.
  35. Zero-shot Object Counting. arXiv preprint arXiv:2303.02001.
  36. Class-agnostic few-shot object counting. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 870–878.
  37. Single-image crowd counting via multi-column convolutional neural network. In Proceedings of the IEEE conference on computer vision and pattern recognition, 589–597.
  38. Extract free dense labels from clip. In European Conference on Computer Vision, 696–712. Springer.
Citations (10)

Summary

We haven't generated a summary for this paper yet.