CountFormer: Multi-View Crowd Counting Transformer (2407.02047v1)
Abstract: Multi-view counting (MVC) methods have shown their superiority over single-view counterparts, particularly in situations characterized by heavy occlusion and severe perspective distortions. However, hand-crafted heuristic features and identical camera layout requirements in conventional MVC methods limit their applicability and scalability in real-world scenarios.In this work, we propose a concise 3D MVC framework called \textbf{CountFormer}to elevate multi-view image-level features to a scene-level volume representation and estimate the 3D density map based on the volume features. By incorporating a camera encoding strategy, CountFormer successfully embeds camera parameters into the volume query and image-level features, enabling it to handle various camera layouts with significant differences.Furthermore, we introduce a feature lifting module capitalized on the attention mechanism to transform image-level features into a 3D volume representation for each camera view. Subsequently, the multi-view volume aggregation module attentively aggregates various multi-view volumes to create a comprehensive scene-level volume representation, allowing CountFormer to handle images captured by arbitrary dynamic camera layouts. The proposed method performs favorably against the state-of-the-art approaches across various widely used datasets, demonstrating its greater suitability for real-world deployment compared to conventional MVC frameworks.
- Switching convolutional neural network for crowd counting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017.
- Stereo inverse perspective mapping: theory and applications. Image and Vision Computing (IVC), 16(8):585–590, 1998.
- Crowdnet: A deep convolutional network for dense crowd counting. In Proceedings of the International Conference on Multimedia (MM), pages 640–644. ACM, 2016.
- nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11621–11631. IEEE, 2020.
- Scale aggregation network for accurate and efficient crowd counting. In Proceedings of the European Conference on Computer Vision (ECCV). Springer, 2018.
- Rethinking spatial invariance of convolutional networks for object counting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19638–19648. IEEE, 2022.
- An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (ICLR), 2020.
- Redesigning multi-scale neural network for crowd counting. IEEE Transactions on Image Processing (TIP), 2023.
- Multi-level feature fusion based locality-constrained spatial transformer network for video crowd counting. Neurocomputing, 392:98–107, 2020.
- Pets2009: Dataset and challenge. In IEEE International Workshop on Performance Evaluation of Tracking and Surveillance, pages 1–6. IEEE, 2009.
- Congested crowd instance localization with dilated convolutional swin transformer. Neurocomputing, 513:94–103, 2022.
- Forget less, count better: a domain-incremental self-distillation learning benchmark for lifelong crowd counting. Frontiers of Information Technology & Electronic Engineering, 24(2):187–202, 2023.
- Pcc net: Perspective crowd counting via spatial convolutional network. IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 30(10):3486–3498, 2019.
- Nas-fpn: Learning scalable feature pyramid architecture for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7036–7045. IEEE, 2019.
- Fiery: Future instance prediction in bird’s-eye view from surround monocular cameras. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 15273–15282. IEEE, 2021.
- Nas-count: Counting-by-density with neural architecture search. In Proceedings of the European Conference on Computer Vision (ECCV), pages 747–766. Springer, 2020.
- Bevdet: High-performance multi-camera 3d object detection in bird-eye-view. arXiv preprint arXiv:2112.11790, 2021.
- Tri-perspective view for vision-based 3d semantic occupancy prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9223–9232. IEEE, 2023.
- Counting crowds in bad weather. arXiv preprint arXiv:2306.01209, 2023.
- Spatial transformer networks. Advances in Neural Information Processing Systems (NeurIPS), 28, 2015.
- Attention scaling for crowd counting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4706–4715. IEEE, 2020.
- Polarformer: Multi-camera 3d object detection with polar transformer. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), number 1, pages 1042–1050, 2023.
- Hadamard product for low-rank bilinear pooling. arXiv preprint arXiv:1610.04325, 2016.
- Panoptic feature pyramid networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6399–6408. IEEE, 2019.
- Towards using count-level weak supervision for crowd counting. Pattern Recognition (PR), 109:107616, 2021.
- Lanesegnet: Map learning with lane segment perception for autonomous driving. arXiv preprint arXiv:2312.16108, 2023.
- Bevdepth: Acquisition of reliable depth for multi-view 3d object detection. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), volume 37, pages 1477–1485, 2023.
- Csrnet: Dilated convolutional neural networks for understanding the highly congested scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2018.
- Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In Proceedings of the European Conference on Computer Vision (ECCV), pages 1–18. Springer, 2022.
- Fb-bev: Bev representation from forward-backward view transformations. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6919–6928. IEEE, 2023.
- Locating and counting heads in crowds with a depth prior. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 44(12):9056–9072, 2021.
- Transcrowd: weakly-supervised crowd counting with transformers. Science China Information Sciences, 65(6):160104, 2022.
- An end-to-end transformer model for crowd localization. In Proceedings of the European Conference on Computer Vision (ECCV), pages 38–54. Springer, 2022.
- Bevfusion: A simple and robust lidar-camera fusion framework. Advances in Neural Information Processing Systems (NeurIPS), 35:10421–10434, 2022.
- Maptr: Structured modeling and learning for online vectorized hd map construction. In International Conference on Learning Representations (ICLR), 2022.
- Feature pyramid networks for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2117–2125. IEEE, 2017.
- Point-query quadtree for crowd counting, localization, and more. arXiv preprint arXiv:2308.13814, 2023.
- Sparsebev: High-performance sparse 3d object detection from multi-camera videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 18580–18590, 2023.
- Decidenet: Counting varying density crowds through attention guided detection and density estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5197–5206. IEEE, 2018.
- Weighing counts: Sequential crowd counting by reinforcement learning. In Proceedings of the European Conference on Computer Vision (ECCV), pages 164–181. Springer, 2020.
- Crowd counting with deep structured scale integration network. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, 2019.
- Adcrowdnet: An attention-injective deformable convolutional network for crowd understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2019.
- Leveraging self-supervision for cross-domain crowd counting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5341–5352. IEEE, 2022.
- Context-aware crowd counting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5099–5108. IEEE, 2019.
- Petr: Position embedding transformation for multi-view 3d object detection. In Proceedings of the European Conference on Computer Vision (ECCV), pages 531–548. Springer, 2022.
- Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 10012–10022, 2021.
- Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation. In Proceddings of the IEEE International Conference on Robotics and Automation (ICRA), pages 2774–2781. IEEE, 2023.
- Fusioncount: efficient crowd counting via multiscale feature fusion. In International Conference on Image Processing (ICIP), pages 3256–3260. IEEE, 2022.
- Towards a universal model for cross-dataset crowd counting. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3205–3214. IEEE, 2021.
- Bayesian loss for crowd count estimation with point supervision. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6142–6151. IEEE, 2019.
- Bev-guided multi-modality fusion for driving perception. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21960–21969, 2023.
- Background noise filtering and distribution dividing for crowd counting. IEEE Transactions on Image Processing (TIP), 29:8199–8212, 2020.
- Attention-guided collaborative counting. IEEE Transactions on Image Processing (TIP), 31:6306–6319, 2022.
- Attention guided region division for crowd counting. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2568–2572. IEEE, 2020.
- Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In Proceedings of the European Conference on Computer Vision (ECCV), pages 194–210. Springer, 2020.
- Cross view fusion for 3d human pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4342–4351. IEEE, 2019.
- Diffuse-denoise-count: Accurate crowd-counting with diffusion models. arXiv preprint arXiv:2303.12790, 2023.
- Performance measures and a data set for multi-target, multi-camera tracking. In Proceedings of the European Conference on Computer Vision (ECCV), pages 17–35. Springer, 2016.
- Revisiting perspective information for efficient crowd counting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7279–7288. IEEE, 2019.
- A real-time deep network for crowd counting. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2328–2332. IEEE, 2020.
- Cnn-based cascaded multi-task learning of high-level prior and density estimation for crowd counting. In Proceedings of the International Conference on Advanced Video and Signal based Surveillance (AVSS), pages 1–6. IEEE, 2017.
- Rethinking counting and localization in crowds: A purely point-based framework. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3365–3374. IEEE, 2021.
- To choose or to fuse? scale selection for crowd counting. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), volume 35, pages 2576–2583, 2021.
- Boosting crowd counting with transformers. arXiv preprint arXiv:2105.10926, 2021.
- Scalability in perception for autonomous driving: Waymo open dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2446–2454. IEEE, 2020.
- Cctrans: Simplifying and improving crowd counting with transformer. arXiv preprint arXiv:2109.14483, 2021.
- Scene as occupancy. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 8406–8415, 2023.
- Attention is all you need. Advances in Neural Information Processing Systems (NeurIPS), 30, 2017.
- Distribution matching for crowd counting. Advances in Neural Information Processing Systems (NeurIPS), pages 1595–1607, 2020.
- Mobilecount: An efficient encoder-decoder framework for real-time crowd counting. Neurocomputing, 407:292–299, 2020.
- Openoccupancy: A large scale benchmark for surrounding semantic occupancy perception. arXiv preprint arXiv:2303.03991, 2023.
- Frustumformer: Adaptive instance-aware resampling for multi-view 3d detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5096–5105, 2023.
- Scene-adaptive attention network for crowd counting. arXiv preprint arXiv:2112.15509, 2021.
- Semi-supervised crowd counting via multiple representation learning. IEEE Transactions on Image Processing (TIP), 2023.
- Surroundocc: Multi-camera 3d occupancy prediction for autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 21729–21740. IEEE, 2023.
- Cobevt: Cooperative bird’s eye view semantic segmentation with sparse transformers. arXiv preprint arXiv:2207.02202, 2022.
- Bevformer v2: Adapting modern image backbones to bird’s-eye-view recognition via perspective supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17830–17839. IEEE, 2023.
- Crowdformer: An overlap patching vision transformer for top-down crowd counting. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pages 23–29, 2022.
- Rggnet: Tolerance aware lidar-camera online calibration with geometric deep learning and generative model. IEEE Robotics and Automation Letters (RA-L), 5(4):6956–6963, 2020.
- Translation, scale and rotation: Cross-modal alignment meets rgb-infrared vehicle detection. In Proceedings of the European Conference on Computer Vision (ECCV), pages 509–525. Springer, 2022.
- Multi-scale convolutional neural networks for crowd counting. In Proceedings of the IEEE International Conference on Image Processing (ICIP). IEEE, 2017.
- Co-communication graph convolutional network for multi-view crowd counting. IEEE Transactions on Multimedia (TMM), 25:5813–5825, 2022.
- Attentional neural fields for crowd counting. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 5714–5723. IEEE, 2019.
- Cross-scene crowd counting via deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 833–841. IEEE, 2015.
- Crowd counting via scale-adaptive convolutional neural network. In Winter Conference on Applications of Computer Vision (WACV), pages 1113–1121. IEEE, 2018.
- Wide-area crowd counting via ground-plane density maps and multi-view fusion cnns. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8297–8306. IEEE, 2019.
- 3d crowd counting via multi-view fusion with 3d gaussian kernels. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), volume 34, pages 12837–12844, 2020.
- 3d crowd counting via geometric attention-guided multi-view fusion. International Journal of Computer Vision (IJCV), 130(12):3123–3139, 2022.
- Calibration-free multi-view crowd counting. In Proceedings of the European Conference on Computer Vision (ECCV), pages 227–244. Springer, 2022.
- Wide-area crowd counting: Multi-view fusion networks for counting in large scenes. International Journal of Computer Vision (IJCV), 130(8):1938–1960, 2022.
- Cross-view cross-scene multi-view crowd counting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 557–567. IEEE, 2021.
- Dcnas: Densely connected neural architecture search for semantic image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13956–13967. IEEE, 2021.
- Single-image crowd counting via multi-column convolutional neural network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 589–597. IEEE, 2016.
- Occformer: Dual-path transformer for vision-based 3d semantic occupancy prediction. arXiv preprint arXiv:2304.05316, 2023.
- Learning factorized cross-view fusion for multi-view crowd counting. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME), pages 1–6. IEEE, 2021.
- Cross-view transformers for real-time map-view semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13760–13769. IEEE, 2022.
- Daot: Domain-agnostically aligned optimal transport for domain-adaptive crowd counting. arXiv preprint arXiv:2308.05311, 2023.
- Deformable detr: Deformable transformers for end-to-end object detection. In International Conference on Learning Representations (ICLR), 2020.