VLCounter: Text-aware Visual Representation for Zero-Shot Object Counting (2312.16580v2)
Abstract: Zero-Shot Object Counting (ZSOC) aims to count referred instances of arbitrary classes in a query image without human-annotated exemplars. To deal with ZSOC, preceding studies proposed a two-stage pipeline: discovering exemplars and counting. However, there remains a challenge of vulnerability to error propagation of the sequentially designed two-stage process. In this work, an one-stage baseline, Visual-Language Baseline (VLBase), exploring the implicit association of the semantic-patch embeddings of CLIP is proposed. Subsequently, the extension of VLBase to Visual-language Counter (VLCounter) is achieved by incorporating three modules devised to tailor VLBase for object counting. First, Semantic-conditioned Prompt Tuning (SPT) is introduced within the image encoder to acquire target-highlighted representations. Second, Learnable Affine Transformation (LAT) is employed to translate the semantic-patch similarity map to be appropriate for the counting task. Lastly, the layer-wisely encoded features are transferred to the decoder through Segment-aware Skip Connection (SaSC) to keep the generalization capability for unseen classes. Through extensive experiments on FSC147, CARPK, and PUCPR+, the benefits of the end-to-end framework, VLCounter, are demonstrated.
- Localization in the crowd with topological constraints. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, 872–881.
- Counting in the wild. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part VII 14, 483–498. Springer.
- Language models are few-shot learners. Advances in neural information processing systems, 33: 1877–1901.
- Privacy preserving crowd monitoring: Counting people without people models or tracking. In 2008 IEEE conference on computer vision and pattern recognition, 1–7. IEEE.
- Counting everyday objects in everyday scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, 1135–1144.
- Object counting and instance segmentation with image-level supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12397–12405.
- Class-agnostic object counting robust to intraclass diversity. In European Conference on Computer Vision, 388–403. Springer.
- Ppt: Pre-trained prompt tuning for few-shot learning. arXiv preprint arXiv:2109.04332.
- Ranking info noise contrastive estimation: Boosting contrastive learning via ranked positives. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, 897–905.
- Drone-based object counting by spatially regularized regional proposal network. In Proceedings of the IEEE international conference on computer vision, 4145–4153.
- Composition loss for counting, density map estimation and localization in dense crowds. In Proceedings of the European conference on computer vision (ECCV), 532–546.
- Visual prompt tuning. In European Conference on Computer Vision, 709–727. Springer.
- CLIP-Count: Towards Text-Guided Zero-Shot Object Counting. arXiv preprint arXiv:2305.07304.
- Where are the blobs: Counting by localization with point supervision. In Proceedings of the european conference on computer vision (ECCV), 547–562.
- Language-driven Semantic Segmentation. In International Conference on Learning Representations.
- Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190.
- Clip surgery for better explainability with enhancement in open-vocabulary tasks. arXiv preprint arXiv:2304.05653.
- Density map regression guided detection network for rgb-d crowd counting and localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1821–1830.
- Context-aware crowd counting. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 5099–5108.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
- Class-agnostic counting. In Computer Vision–ACCV 2018: 14th Asian Conference on Computer Vision, Perth, Australia, December 2–6, 2018, Revised Selected Papers, Part III 14, 669–684. Springer.
- Image segmentation using text and image prompts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7086–7096.
- Query-dependent video representation for moment retrieval and highlight detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 23023–23033.
- A large contextual dataset for classification, detection and counting of cars with deep learning. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III 14, 785–800. Springer.
- Zero-shot temporal action detection via vision-language prompting. In European Conference on Computer Vision, 681–697. Springer.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, 8748–8763. PMLR.
- Exemplar free class agnostic counting. In Proceedings of the Asian Conference on Computer Vision, 3121–3137.
- Learning to count everything. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3394–3403.
- Denseclip: Language-guided dense prediction with context-aware prompting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 18082–18091.
- Represent, compare, and learn: A similarity-aware framework for class-agnostic counting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9529–9538.
- Pushing the frontiers of unconstrained crowd counting: New dataset and benchmark method. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 1221–1231.
- Distribution matching for crowd counting. Advances in neural information processing systems, 33: 1595–1607.
- Learning to prompt for continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 139–149.
- Microscopy cell counting and detection with fully convolutional regression networks. Computer methods in biomechanics and biomedical engineering: Imaging & Visualization, 6(3): 283–292.
- Zero-shot Object Counting. arXiv preprint arXiv:2303.02001.
- Class-agnostic few-shot object counting. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 870–878.
- Single-image crowd counting via multi-column convolutional neural network. In Proceedings of the IEEE conference on computer vision and pattern recognition, 589–597.
- Extract free dense labels from clip. In European Conference on Computer Vision, 696–712. Springer.