Vision Transformer Off-the-Shelf: A Surprising Baseline for Few-Shot Class-Agnostic Counting (2305.04440v2)
Abstract: Class-agnostic counting (CAC) aims to count objects of interest from a query image given few exemplars. This task is typically addressed by extracting the features of query image and exemplars respectively and then matching their feature similarity, leading to an extract-then-match paradigm. In this work, we show that CAC can be simplified in an extract-and-match manner, particularly using a vision transformer (ViT) where feature extraction and similarity matching are executed simultaneously within the self-attention. We reveal the rationale of such simplification from a decoupled view of the self-attention. The resulting model, termed CACViT, simplifies the CAC pipeline into a single pretrained plain ViT. Further, to compensate the loss of the scale and the order-of-magnitude information due to resizing and normalization in plain ViT, we present two effective strategies for scale and magnitude embedding. Extensive experiments on the FSC147 and the CARPK datasets show that CACViT significantly outperforms state-of-the art CAC approaches in both effectiveness (23.60% error reduction) and generalization, which suggests CACViT provides a concise and strong baseline for CAC. Code will be available.
- Localization in the crowd with topological constraints. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, 872–881.
- Scale aggregation network for accurate and efficient crowd counting. In Proceedings of the European conference on computer vision (ECCV), 734–750.
- MixFormer: Mixing Features Across Windows and Dimensions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 5249–5259.
- Rethinking spatial invariance of convolutional networks for object counting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 19638–19648.
- Class-Agnostic Object Counting Robust to Intraclass Diversity. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXIII, 388–403. Springer.
- Few-Shot Object Detection With Fully Cross-Transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 5321–5330.
- Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16000–16009.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778.
- Error-aware density isomorphism reconstruction for unsupervised cross-domain crowd counting. In Proceedings of the AAAI conference on artificial intelligence, volume 35, 1540–1548.
- Drone-based object counting by spatially regularized regional proposal network. In Proceedings of the IEEE international conference on computer vision, 4145–4153.
- Composition loss for counting, density map estimation and localization in dense crowds. In Proceedings of the European conference on computer vision (ECCV), 532–546.
- ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision. In Meila, M.; and Zhang, T., eds., Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, 5583–5594. PMLR.
- Where are the blobs: Counting by localization with point supervision. In Proceedings of the european conference on computer vision (ECCV), 547–562.
- Exploring Plain Vision Transformer Backbones for Object Detection. arXiv:2203.16527.
- Scale-Prior Deformable Convolution for Exemplar-Guided Class-Agnostic Counting.
- DETR Doesn’t Need Multi-Scale or Locality Design. arXiv:2308.01904.
- Countr: Transformer-based generalised visual counting. arXiv preprint arXiv:2208.13721.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
- Class-agnostic counting. In Computer Vision–ACCV 2018: 14th Asian Conference on Computer Vision, Perth, Australia, December 2–6, 2018, Revised Selected Papers, Part III 14, 669–684. Springer.
- TasselNet: counting maize tassels in the wild via local counts regression network. Plant methods, 13(1): 1–17.
- Ear density estimation from high resolution RGB imagery using deep learning technique. Agricultural and forest meteorology, 264: 225–234.
- Towards perspective-free object counting with deep learning. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part VII 14, 615–629. Springer.
- Learning to count everything. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3394–3403.
- Represent, compare, and learn: A similarity-aware framework for class-agnostic counting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9529–9538.
- Crowd counting in the frequency domain. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 19618–19627.
- Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv:2307.09288.
- Attention is all you need. Advances in neural information processing systems, 30.
- NWPU-crowd: A large-scale benchmark for crowd counting and localization. IEEE transactions on pattern analysis and machine intelligence, 43(6): 2141–2149.
- ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation. In Koyejo, S.; Mohamed, S.; Agarwal, A.; Belgrave, D.; Cho, K.; and Oh, A., eds., Advances in Neural Information Processing Systems, volume 35, 38571–38584. Curran Associates, Inc.
- ViTPose++: Vision Transformer Foundation Model for Generic Body Pose Estimation. arXiv:2212.04246.
- Class-agnostic few-shot object counting. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 870–878.
- ViTMatte: Boosting Image Matting with Pretrained Plain Vision Transformers. arXiv:2305.15272.
- Few-shot object counting with similarity-aware feature enhancement. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 6315–6324.
- MetaFormer Is Actually What You Need for Vision. arXiv:2111.11418.
- Cross-scene crowd counting via deep convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, 833–841.
- Coarse to fine: Domain adaptive crowd counting via adversarial scoring network. In Proceedings of the 29th ACM International Conference on Multimedia, 2185–2194.