Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
166 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CountFormer: Multi-View Crowd Counting Transformer (2407.02047v1)

Published 2 Jul 2024 in cs.CV

Abstract: Multi-view counting (MVC) methods have shown their superiority over single-view counterparts, particularly in situations characterized by heavy occlusion and severe perspective distortions. However, hand-crafted heuristic features and identical camera layout requirements in conventional MVC methods limit their applicability and scalability in real-world scenarios.In this work, we propose a concise 3D MVC framework called \textbf{CountFormer}to elevate multi-view image-level features to a scene-level volume representation and estimate the 3D density map based on the volume features. By incorporating a camera encoding strategy, CountFormer successfully embeds camera parameters into the volume query and image-level features, enabling it to handle various camera layouts with significant differences.Furthermore, we introduce a feature lifting module capitalized on the attention mechanism to transform image-level features into a 3D volume representation for each camera view. Subsequently, the multi-view volume aggregation module attentively aggregates various multi-view volumes to create a comprehensive scene-level volume representation, allowing CountFormer to handle images captured by arbitrary dynamic camera layouts. The proposed method performs favorably against the state-of-the-art approaches across various widely used datasets, demonstrating its greater suitability for real-world deployment compared to conventional MVC frameworks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (98)
  1. Switching convolutional neural network for crowd counting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017.
  2. Stereo inverse perspective mapping: theory and applications. Image and Vision Computing (IVC), 16(8):585–590, 1998.
  3. Crowdnet: A deep convolutional network for dense crowd counting. In Proceedings of the International Conference on Multimedia (MM), pages 640–644. ACM, 2016.
  4. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11621–11631. IEEE, 2020.
  5. Scale aggregation network for accurate and efficient crowd counting. In Proceedings of the European Conference on Computer Vision (ECCV). Springer, 2018.
  6. Rethinking spatial invariance of convolutional networks for object counting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19638–19648. IEEE, 2022.
  7. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (ICLR), 2020.
  8. Redesigning multi-scale neural network for crowd counting. IEEE Transactions on Image Processing (TIP), 2023.
  9. Multi-level feature fusion based locality-constrained spatial transformer network for video crowd counting. Neurocomputing, 392:98–107, 2020.
  10. Pets2009: Dataset and challenge. In IEEE International Workshop on Performance Evaluation of Tracking and Surveillance, pages 1–6. IEEE, 2009.
  11. Congested crowd instance localization with dilated convolutional swin transformer. Neurocomputing, 513:94–103, 2022.
  12. Forget less, count better: a domain-incremental self-distillation learning benchmark for lifelong crowd counting. Frontiers of Information Technology & Electronic Engineering, 24(2):187–202, 2023.
  13. Pcc net: Perspective crowd counting via spatial convolutional network. IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 30(10):3486–3498, 2019.
  14. Nas-fpn: Learning scalable feature pyramid architecture for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7036–7045. IEEE, 2019.
  15. Fiery: Future instance prediction in bird’s-eye view from surround monocular cameras. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 15273–15282. IEEE, 2021.
  16. Nas-count: Counting-by-density with neural architecture search. In Proceedings of the European Conference on Computer Vision (ECCV), pages 747–766. Springer, 2020.
  17. Bevdet: High-performance multi-camera 3d object detection in bird-eye-view. arXiv preprint arXiv:2112.11790, 2021.
  18. Tri-perspective view for vision-based 3d semantic occupancy prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9223–9232. IEEE, 2023.
  19. Counting crowds in bad weather. arXiv preprint arXiv:2306.01209, 2023.
  20. Spatial transformer networks. Advances in Neural Information Processing Systems (NeurIPS), 28, 2015.
  21. Attention scaling for crowd counting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4706–4715. IEEE, 2020.
  22. Polarformer: Multi-camera 3d object detection with polar transformer. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), number 1, pages 1042–1050, 2023.
  23. Hadamard product for low-rank bilinear pooling. arXiv preprint arXiv:1610.04325, 2016.
  24. Panoptic feature pyramid networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6399–6408. IEEE, 2019.
  25. Towards using count-level weak supervision for crowd counting. Pattern Recognition (PR), 109:107616, 2021.
  26. Lanesegnet: Map learning with lane segment perception for autonomous driving. arXiv preprint arXiv:2312.16108, 2023.
  27. Bevdepth: Acquisition of reliable depth for multi-view 3d object detection. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), volume 37, pages 1477–1485, 2023.
  28. Csrnet: Dilated convolutional neural networks for understanding the highly congested scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2018.
  29. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In Proceedings of the European Conference on Computer Vision (ECCV), pages 1–18. Springer, 2022.
  30. Fb-bev: Bev representation from forward-backward view transformations. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6919–6928. IEEE, 2023.
  31. Locating and counting heads in crowds with a depth prior. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 44(12):9056–9072, 2021.
  32. Transcrowd: weakly-supervised crowd counting with transformers. Science China Information Sciences, 65(6):160104, 2022.
  33. An end-to-end transformer model for crowd localization. In Proceedings of the European Conference on Computer Vision (ECCV), pages 38–54. Springer, 2022.
  34. Bevfusion: A simple and robust lidar-camera fusion framework. Advances in Neural Information Processing Systems (NeurIPS), 35:10421–10434, 2022.
  35. Maptr: Structured modeling and learning for online vectorized hd map construction. In International Conference on Learning Representations (ICLR), 2022.
  36. Feature pyramid networks for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2117–2125. IEEE, 2017.
  37. Point-query quadtree for crowd counting, localization, and more. arXiv preprint arXiv:2308.13814, 2023.
  38. Sparsebev: High-performance sparse 3d object detection from multi-camera videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 18580–18590, 2023.
  39. Decidenet: Counting varying density crowds through attention guided detection and density estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5197–5206. IEEE, 2018.
  40. Weighing counts: Sequential crowd counting by reinforcement learning. In Proceedings of the European Conference on Computer Vision (ECCV), pages 164–181. Springer, 2020.
  41. Crowd counting with deep structured scale integration network. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, 2019.
  42. Adcrowdnet: An attention-injective deformable convolutional network for crowd understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2019.
  43. Leveraging self-supervision for cross-domain crowd counting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5341–5352. IEEE, 2022.
  44. Context-aware crowd counting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5099–5108. IEEE, 2019.
  45. Petr: Position embedding transformation for multi-view 3d object detection. In Proceedings of the European Conference on Computer Vision (ECCV), pages 531–548. Springer, 2022.
  46. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 10012–10022, 2021.
  47. Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation. In Proceddings of the IEEE International Conference on Robotics and Automation (ICRA), pages 2774–2781. IEEE, 2023.
  48. Fusioncount: efficient crowd counting via multiscale feature fusion. In International Conference on Image Processing (ICIP), pages 3256–3260. IEEE, 2022.
  49. Towards a universal model for cross-dataset crowd counting. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3205–3214. IEEE, 2021.
  50. Bayesian loss for crowd count estimation with point supervision. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6142–6151. IEEE, 2019.
  51. Bev-guided multi-modality fusion for driving perception. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21960–21969, 2023.
  52. Background noise filtering and distribution dividing for crowd counting. IEEE Transactions on Image Processing (TIP), 29:8199–8212, 2020.
  53. Attention-guided collaborative counting. IEEE Transactions on Image Processing (TIP), 31:6306–6319, 2022.
  54. Attention guided region division for crowd counting. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2568–2572. IEEE, 2020.
  55. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In Proceedings of the European Conference on Computer Vision (ECCV), pages 194–210. Springer, 2020.
  56. Cross view fusion for 3d human pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4342–4351. IEEE, 2019.
  57. Diffuse-denoise-count: Accurate crowd-counting with diffusion models. arXiv preprint arXiv:2303.12790, 2023.
  58. Performance measures and a data set for multi-target, multi-camera tracking. In Proceedings of the European Conference on Computer Vision (ECCV), pages 17–35. Springer, 2016.
  59. Revisiting perspective information for efficient crowd counting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7279–7288. IEEE, 2019.
  60. A real-time deep network for crowd counting. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2328–2332. IEEE, 2020.
  61. Cnn-based cascaded multi-task learning of high-level prior and density estimation for crowd counting. In Proceedings of the International Conference on Advanced Video and Signal based Surveillance (AVSS), pages 1–6. IEEE, 2017.
  62. Rethinking counting and localization in crowds: A purely point-based framework. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3365–3374. IEEE, 2021.
  63. To choose or to fuse? scale selection for crowd counting. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), volume 35, pages 2576–2583, 2021.
  64. Boosting crowd counting with transformers. arXiv preprint arXiv:2105.10926, 2021.
  65. Scalability in perception for autonomous driving: Waymo open dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2446–2454. IEEE, 2020.
  66. Cctrans: Simplifying and improving crowd counting with transformer. arXiv preprint arXiv:2109.14483, 2021.
  67. Scene as occupancy. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 8406–8415, 2023.
  68. Attention is all you need. Advances in Neural Information Processing Systems (NeurIPS), 30, 2017.
  69. Distribution matching for crowd counting. Advances in Neural Information Processing Systems (NeurIPS), pages 1595–1607, 2020.
  70. Mobilecount: An efficient encoder-decoder framework for real-time crowd counting. Neurocomputing, 407:292–299, 2020.
  71. Openoccupancy: A large scale benchmark for surrounding semantic occupancy perception. arXiv preprint arXiv:2303.03991, 2023.
  72. Frustumformer: Adaptive instance-aware resampling for multi-view 3d detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5096–5105, 2023.
  73. Scene-adaptive attention network for crowd counting. arXiv preprint arXiv:2112.15509, 2021.
  74. Semi-supervised crowd counting via multiple representation learning. IEEE Transactions on Image Processing (TIP), 2023.
  75. Surroundocc: Multi-camera 3d occupancy prediction for autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 21729–21740. IEEE, 2023.
  76. Cobevt: Cooperative bird’s eye view semantic segmentation with sparse transformers. arXiv preprint arXiv:2207.02202, 2022.
  77. Bevformer v2: Adapting modern image backbones to bird’s-eye-view recognition via perspective supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17830–17839. IEEE, 2023.
  78. Crowdformer: An overlap patching vision transformer for top-down crowd counting. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pages 23–29, 2022.
  79. Rggnet: Tolerance aware lidar-camera online calibration with geometric deep learning and generative model. IEEE Robotics and Automation Letters (RA-L), 5(4):6956–6963, 2020.
  80. Translation, scale and rotation: Cross-modal alignment meets rgb-infrared vehicle detection. In Proceedings of the European Conference on Computer Vision (ECCV), pages 509–525. Springer, 2022.
  81. Multi-scale convolutional neural networks for crowd counting. In Proceedings of the IEEE International Conference on Image Processing (ICIP). IEEE, 2017.
  82. Co-communication graph convolutional network for multi-view crowd counting. IEEE Transactions on Multimedia (TMM), 25:5813–5825, 2022.
  83. Attentional neural fields for crowd counting. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 5714–5723. IEEE, 2019.
  84. Cross-scene crowd counting via deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 833–841. IEEE, 2015.
  85. Crowd counting via scale-adaptive convolutional neural network. In Winter Conference on Applications of Computer Vision (WACV), pages 1113–1121. IEEE, 2018.
  86. Wide-area crowd counting via ground-plane density maps and multi-view fusion cnns. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8297–8306. IEEE, 2019.
  87. 3d crowd counting via multi-view fusion with 3d gaussian kernels. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), volume 34, pages 12837–12844, 2020.
  88. 3d crowd counting via geometric attention-guided multi-view fusion. International Journal of Computer Vision (IJCV), 130(12):3123–3139, 2022.
  89. Calibration-free multi-view crowd counting. In Proceedings of the European Conference on Computer Vision (ECCV), pages 227–244. Springer, 2022.
  90. Wide-area crowd counting: Multi-view fusion networks for counting in large scenes. International Journal of Computer Vision (IJCV), 130(8):1938–1960, 2022.
  91. Cross-view cross-scene multi-view crowd counting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 557–567. IEEE, 2021.
  92. Dcnas: Densely connected neural architecture search for semantic image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13956–13967. IEEE, 2021.
  93. Single-image crowd counting via multi-column convolutional neural network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 589–597. IEEE, 2016.
  94. Occformer: Dual-path transformer for vision-based 3d semantic occupancy prediction. arXiv preprint arXiv:2304.05316, 2023.
  95. Learning factorized cross-view fusion for multi-view crowd counting. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME), pages 1–6. IEEE, 2021.
  96. Cross-view transformers for real-time map-view semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13760–13769. IEEE, 2022.
  97. Daot: Domain-agnostically aligned optimal transport for domain-adaptive crowd counting. arXiv preprint arXiv:2308.05311, 2023.
  98. Deformable detr: Deformable transformers for end-to-end object detection. In International Conference on Learning Representations (ICLR), 2020.

Summary

We haven't generated a summary for this paper yet.