From a Bird's Eye View to See: Joint Camera and Subject Registration without the Camera Calibration (2212.09298v3)
Abstract: We tackle a new problem of multi-view camera and subject registration in the bird's eye view (BEV) without pre-given camera calibration. This is a very challenging problem since its only input is several RGB images from different first-person views (FPVs) for a multi-person scene, without the BEV image and the calibration of the FPVs, while the output is a unified plane with the localization and orientation of both the subjects and cameras in a BEV. We propose an end-to-end framework solving this problem, whose main idea can be divided into following parts: i) creating a view-transform subject detection module to transform the FPV to a virtual BEV including localization and orientation of each pedestrian, ii) deriving a geometric transformation based method to estimate camera localization and view direction, i.e., the camera registration in a unified BEV, iii) making use of spatial and appearance information to aggregate the subjects into the unified BEV. We collect a new large-scale synthetic dataset with rich annotations for evaluation. The experimental results show the remarkable effectiveness of our proposed method.
- Ego2top: Matching viewers in egocentric and top-view videos. In Proceedings of the European Conference on Computer Vision, pages 253–268, 2016.
- Egocentric meets top-view. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(6):1353–1366, 2018a.
- Integrating egocentric videos in top-view surveillance videos: Joint identification and temporal alignment. In Proceedings of the European Conference on Computer Vision, pages 285–300, 2018b.
- Deep occlusion reasoning for multi-camera multi-target detection. In Proceedings of the IEEE International Conference on Computer Vision, pages 271–279, 2017.
- Fantrack: 3D multi-object tracking with feature association network. In Proceedings of the IEEE Intelligent Vehicles Symposium, pages 1426–1433, 2019.
- Score refinement for confidence-based 3D multi-object tracking. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 8083–8090, 2021.
- Monoloco: Monocular 3D pedestrian localization and uncertainty estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6861–6871, 2019.
- Perceiving humans: from monocular 3D localization to social distancing. IEEE Transactions on Intelligent Transportation Systems, 23(7):7401–7418, 2021.
- Online inspection of 3D parts via a locally overlapping camera network. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision, pages 1–10, 2016.
- Learning a neural solver for multiple object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6247–6257, 2020.
- Simultaneous calibration of odometry and sensor parameters for mobile robots. IEEE Transactions on Robotics, 29(2):475–492, 2013.
- Deft: Detection embeddings for tracking. arXiv preprint arXiv:2102.02267, 2021.
- Deep multi-camera people detection. In Proceedings of the IEEE International Conference on Machine Learning and Applications, pages 848–853, 2017.
- Wildtrack: A multi-camera hd dataset for dense unscripted pedestrian detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5030–5039, 2018.
- Monorun: Monocular 3D object detection by reconstruction and uncertainty propagation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10379–10388, 2021.
- Graph-detr3d: Rethinking overlapping regions for multi-view 3D object detection. In Proceedings of the 30th ACM International Conference on Multimedia, pages 5999–6008, 2022.
- Probabilistic 3D multi-modal, multi-object tracking for autonomous driving. In Proceedings of the IEEE International Conference on Robotics and Automation, pages 14227–14233, 2021.
- Fast and robust multi-person 3D pose estimation and tracking from multiple views. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(10):6981–6992, 2021.
- Extrinsic calibration of a non-overlapping camera network based on close-range photogrammetry. Applied optics, 55(23):6363–6370, 2016.
- Multiple human association between top and horizontal views by matching subjects’ spatial distributions. arXiv preprint arXiv:1907.11458, 2019.
- Complementary-view co-interest person detection. In Proceedings of the ACM International Conference on Multimedia, pages 2746–2754, 2020.
- Multiple human association and tracking from egocentric and complementary top views. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(9):5225–5242, 2021.
- Connecting the Complementary-View Videos: Joint Camera Identification and Subject Association. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2416–2425, 2022a.
- Multi-view multi-human association with deep assignment network. IEEE Transactions on Image Processing, 31:1830–1840, 2022b.
- Relating view directions of complementary-view mobile cameras via the human shadow. International Journal of Computer Vision, 131(5):1106–1121, 2023.
- Benchmarking the complementary-view multi-human association and tracking. International Journal of Computer Vision, 132(1):118–136, 2024.
- Patch-netvlad: Multi-scale fusion of locally-global descriptors for place recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14141–14152, 2021.
- Recognition and 3D localization of pedestrian actions from monocular video. In Proceedings of the IEEE International Conference on Intelligent Transportation Systems, pages 1–7, 2020.
- Multiview detection with shadow transformer (and view-coherent data augmentation). In Proceedings of the ACM International Conference on Multimedia, pages 1673–1682, 2021.
- Multiview detection with feature perspective transformation. In Proceedings of the European Conference on Computer Vision, pages 1–18, 2020.
- Bevdet: High-performance multi-camera 3D object detection in bird-eye-view. arXiv preprint arXiv:2112.11790, 2021.
- Consistent shape maps via semidefinite programming. In Computer graphics forum, pages 177–186, 2013.
- Polarformer: Multi-camera 3D object detection with polar transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 1042–1050, 2023.
- PifPaf: Composite Fields for Human Pose Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11977–11986, 2019.
- Correlation verification for image retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5374–5384, 2022.
- Stereo vision-based semantic 3D object and ego-motion tracking for autonomous driving. In Proceedings of the European Conference on Computer Vision, pages 646–661, 2018.
- Bevdepth: Acquisition of reliable depth for multi-view 3D object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 1477–1485, 2023.
- Unsupervised domain adaptation for monocular 3D object detection via self-training. In Proceedings of the European Conference on Computer Vision, pages 245–262, 2022a.
- Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In Proceedings of the European Conference on Computer Vision, pages 1–18, 2022b.
- An external parameter calibration method for multiple cameras based on laser rangefinder. Measurement, 47:954–962, 2014.
- David G Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2):91–110, 2004.
- M3dssd: Monocular 3D single stage object detector. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6145–6154, 2021.
- Delving into localization errors for monocular 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4721–4730, 2021.
- Simpletrack: Understanding and rethinking 3D multi-object tracking. In Proceedings of the European Conference on Computer Vision, pages 680–696, 2023.
- Robert Clay Prim. Shortest connection networks and some generalizations. The Bell System Technical Journal, 36(6):1389–1401, 1957.
- Categorical depth distribution network for monocular 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8555–8564, 2021.
- John Riccitiello. John riccitiello sets out to identify the engine of growth for unity technologies (interview). VentureBeat. Interview with Dean Takahashi. Retrieved January, 18(3), 2015.
- Superglue: Learning feature matching with graph neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4938–4947, 2020.
- Structure-from-motion revisited. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4104–4113, 2016.
- 3D human pose estimation from multi person stereo 360 scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 1–8, 2019.
- Spatial-aware feature aggregation for image based cross-view geo-localization. Advances in Neural Information Processing Systems, 32, 2019.
- Where am I looking at? joint location and orientation estimation by cross-view matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4064–4072, 2020.
- Stacked homography transformations for multi-view pedestrian detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6049–6057, 2021.
- LoFTR: Detector-free local feature matching with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8922–8931, 2021.
- Dissecting person re-identification from the viewpoint of viewpoint. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 608–617, 2020.
- Multi-camera vehicle tracking and re-identification based on visual and spatial-temporal features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 275–284, 2019.
- Depth-conditioned dynamic message propagation for monocular 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 454–463, 2021.
- Monocular 3D object detection with depth from motion. In Proceedings of the European Conference on Computer Vision, pages 386–403. Springer, 2022a.
- Detr3d: 3D object detection from multi-view images via 3d-to-2d queries. In Conference on Robot Learning, pages 180–191. PMLR, 2022b.
- 3D multi-object tracking: A baseline and new evaluation metrics. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 10359–10366, 2020a.
- Gnn3dmot: Graph neural network for 3D multi-object tracking with multi-feature learning. arXiv preprint arXiv:2006.07327, 2020b.
- Center-based 3D object detection and tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11784–11793, 2021.
- Learnable online graph representations for 3D multi-object tracking. IEEE Robotics and Automation Letters, 7(2):5103–5110, 2022.
- Scalable person re-identification: A benchmark. In Proceedings of the IEEE International Conference on Computer Vision, pages 1116–1124, 2015.
- R2former: Unified retrieval and reranking transformer for place recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19370–19380, 2023.