MSSPlace: Multi-Sensor Place Recognition with Visual and Text Semantics (2407.15663v1)
Abstract: Place recognition is a challenging task in computer vision, crucial for enabling autonomous vehicles and robots to navigate previously visited environments. While significant progress has been made in learnable multimodal methods that combine onboard camera images and LiDAR point clouds, the full potential of these methods remains largely unexplored in localization applications. In this paper, we study the impact of leveraging a multi-camera setup and integrating diverse data sources for multimodal place recognition, incorporating explicit visual semantics and text descriptions. Our proposed method named MSSPlace utilizes images from multiple cameras, LiDAR point clouds, semantic segmentation masks, and text annotations to generate comprehensive place descriptors. We employ a late fusion approach to integrate these modalities, providing a unified representation. Through extensive experiments on the Oxford RobotCar and NCLT datasets, we systematically analyze the impact of each data source on the overall quality of place descriptors. Our experiments demonstrate that combining data from multiple sensors significantly improves place recognition model performance compared to single modality approaches and leads to state-of-the-art quality. We also show that separate usage of visual or textual semantics (which are more compact representations of sensory data) can achieve promising results in place recognition. The code for our method is publicly available: https://github.com/alexmelekhin/MSSPlace
- S. Garg, T. Fischer, and M. Milford, “Where is your place, Visual Place Recognition?” Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, pp. 4416–4425, Aug. 2021, arXiv: 2103.06443. http://arxiv.org/abs/2103.06443
- R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic, “Netvlad: Cnn architecture for weakly supervised place recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
- A. Ali-bey, B. Chaib-draa, and P. Giguère, “Mixvpr: Feature mixing for visual place recognition,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), January 2023, pp. 2998–3007.
- G. Berton, C. Masone, and B. Caputo, “Rethinking visual geo-localization for large-scale applications,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022, pp. 4878–4888.
- G. Berton, G. Trivigno, B. Caputo, and C. Masone, “EigenPlaces: Training Viewpoint Robust Models for Visual Place Recognition,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 11 080–11 090.
- N. Keetha, A. Mishra, J. Karhade, K. M. Jatavallabhula, S. Scherer, M. Krishna, and S. Garg, “AnyLoc: Towards Universal Visual Place Recognition,” IEEE Robotics and Automation Letters, vol. 9, no. 2, pp. 1286–1293, Feb. 2024.
- J. Komorowski, “Minkloc3d: Point cloud based large-scale place recognition,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), January 2021, pp. 1790–1799.
- ——, “Improving point cloud based place recognition with ranking-based loss and large batch training,” in 2022 26th International Conference on Pattern Recognition (ICPR). IEEE, 2022, pp. 3699–3705.
- Z. Fan, Z. Song, H. Liu, Z. Lu, J. He, and X. Du, “SVT-Net: Super Light-Weight Sparse Voxel Transformer for Large Scale Place Recognition,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 1, pp. 551–560, Jun. 2022, number: 1. https://ojs.aaai.org/index.php/AAAI/article/view/19934
- J. Ma, J. Zhang, J. Xu, R. Ai, W. Gu, and X. Chen, “Overlaptransformer: An efficient and yaw-angle-invariant transformer network for lidar-based place recognition,” pp. 6958–6965, 2022.
- L. Luo, S. Zheng, Y. Li, Y. Fan, B. Yu, S.-Y. Cao, J. Li, and H.-L. Shen, “BEVPlace: Learning LiDAR-based Place Recognition using Bird’s Eye View Images,” pp. 8700–8709, 2023.
- L. Chen, H. Wang, H. Kong, W. Yang, and M. Ren, “PTC-Net: Point-Wise Transformer With Sparse Convolution Network for Place Recognition,” IEEE Robotics and Automation Letters, vol. 8, no. 6, pp. 3414–3421, Jun. 2023.
- S. Xie, C. Pan, Y. Peng, K. Liu, and S. Ying, “Large-Scale Place Recognition Based on Camera-LiDAR Fused Descriptor,” Sensors, vol. 20, no. 10, p. 2870, Jan. 2020, number: 10 Publisher: Multidisciplinary Digital Publishing Institute. https://www.mdpi.com/1424-8220/20/10/2870
- Y. Lu, F. Yang, F. Chen, and D. Xie, “Pic-net: Point cloud and image collaboration network for large-scale place recognition,” 2020.
- J. Komorowski, M. Wysoczańska, and T. Trzcinski, “Minkloc++: lidar and monocular image fusion for place recognition,” in 2021 International Joint Conference on Neural Networks (IJCNN). IEEE, 2021, pp. 1–8.
- L. Bernreiter, L. Ott, J. Nieto, R. Siegwart, and C. Cadena, “Spherical Multi-Modal Place Recognition for Heterogeneous Sensor Systems,” in 2021 IEEE International Conference on Robotics and Automation (ICRA), May 2021, pp. 1743–1750, iSSN: 2577-087X.
- Y. Pan, X. Xu, W. Li, Y. Cui, Y. Wang, and R. Xiong, “CORAL: Colored structural representation for bi-modal place recognition,” in 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Sep. 2021, pp. 2084–2091, iSSN: 2153-0866.
- H. Lai, P. Yin, and S. Scherer, “Adafusion: Visual-lidar fusion with adaptive weights for place recognition,” IEEE Robotics and Automation Letters, vol. 7, no. 4, pp. 12 038–12 045, 2022.
- X. Yu, B. Zhou, Z. Chang, K. Qian, and F. Fang, “MMDF: Multi-Modal Deep Feature Based Place Recognition of Mobile Robots With Applications on Cross-Scene Navigation,” IEEE Robotics and Automation Letters, vol. 7, no. 3, pp. 6742–6749, Jul. 2022, conference Name: IEEE Robotics and Automation Letters.
- D. Acharya, R. Tennakoon, S. Muthu, K. Khoshelham, R. Hoseinnezhad, and A. Bab-Hadiashar, “Single-image localisation using 3d models: Combining hierarchical edge maps and semantic segmentation for domain adaptation,” Automation in Construction, vol. 136, p. 104152, 2022.
- G. Pramatarov, D. De Martini, M. Gadd, and P. Newman, “BoxGraph: Semantic Place Recognition and Pose Estimation from 3D LiDAR,” Jun. 2022, arXiv:2206.15154 [cs]. http://arxiv.org/abs/2206.15154
- D. Kirilenko, A. K. Kovalev, Y. Solomentsev, A. Melekhin, D. A. Yudin, and A. I. Panov, “Vector Symbolic Scene Representation for Semantic Place Recognition,” in 2022 International Joint Conference on Neural Networks (IJCNN), Jul. 2022, pp. 1–8, iSSN: 2161-4407.
- Y. Ming, X. Yang, G. Zhang, and A. Calway, “CGiS-Net: Aggregating Colour, Geometry and Implicit Semantic Features for Indoor Place Recognition,” Jul. 2022, arXiv:2202.02070 [cs]. http://arxiv.org/abs/2202.02070
- Z. Hong, Y. Petillot, D. Lane, Y. Miao, and S. Wang, “Textplace: Visual place recognition and topological localization through reading scene texts,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019.
- P. Li, X. Li, H. Pan, M. O. Khyam, and M. Noor-A-Rahim, “Text-based indoor place recognition with deep neural network,” Neurocomputing, vol. 390, pp. 239–247, May 2020. https://www.sciencedirect.com/science/article/pii/S0925231219314419
- M. Kolmet, Q. Zhou, A. Ošep, and L. Leal-Taixé, “Text2pos: Text-to-point-cloud cross-modal localization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022, pp. 6687–6696.
- D. Hettiarachchi, Y. Tian, H. Yu, and S. Kamijo, “Text Spotting towards Perceptually Aliased Urban Place Recognition,” Multimodal Technologies and Interaction, vol. 6, no. 11, p. 102, Nov. 2022, number: 11 Publisher: Multidisciplinary Digital Publishing Institute. https://www.mdpi.com/2414-4088/6/11/102
- G. Wang, H. Fan, and M. Kankanhalli, “Text to point cloud localization with relation-enhanced transformer,” pp. 2501–2509, 2023.
- W. Maddern, G. Pascoe, C. Linegar, and P. Newman, “1 year, 1000 km: The Oxford RobotCar dataset,” The International Journal of Robotics Research, vol. 36, no. 1, pp. 3–15, Jan. 2017. http://journals.sagepub.com/doi/10.1177/0278364916679498
- W. Maddern, G. Pascoe, M. Gadd, D. Barnes, B. Yeomans, and P. Newman, “Real-time Kinematic Ground Truth for the Oxford RobotCar Dataset,” Feb. 2020, arXiv:2002.10152 [cs]. http://arxiv.org/abs/2002.10152
- N. Carlevaris-Bianco, A. K. Ushani, and R. M. Eustice, “University of Michigan North Campus long-term vision and lidar dataset,” The International Journal of Robotics Research, vol. 35, no. 9, pp. 1023–1035, Aug. 2016. http://journals.sagepub.com/doi/10.1177/0278364915614638
- D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, “Minigpt-4: Enhancing vision-language understanding with advanced large language models,” 2023.
- M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec, V. Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby et al., “Dinov2: Learning robust visual features without supervision,” arXiv preprint arXiv:2304.07193, 2023.
- Z. Fan, Z. Song, H. Liu, Z. Lu, J. He, and X. Du, “SVT-Net: Super Light-Weight Sparse Voxel Transformer for Large Scale Place Recognition,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 1, pp. 551–560, Jun. 2022.
- C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning on point sets for 3d classification and segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
- Z. Liu, S. Zhou, C. Suo, P. Yin, W. Chen, H. Wang, H. Li, and Y.-H. Liu, “Lpd-net: 3d point cloud learning for large-scale place recognition and environment analysis,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019.
- K. Żywanowski, A. Banaszczyk, and M. R. Nowicki, “Comparison of camera-based and 3D LiDAR-based place recognition across weather conditions,” in 2020 16th International Conference on Control, Automation, Robotics and Vision (ICARCV), Dec. 2020, pp. 886–891.
- D. Olid, J. M. Fácil, and J. Civera, “Single-View Place Recognition under Seasonal Changes,” Aug. 2018, number: arXiv:1808.06516 arXiv:1808.06516 [cs]. http://arxiv.org/abs/1808.06516
- F. Radenović, G. Tolias, and O. Chum, “Fine-Tuning CNN Image Retrieval with No Human Annotation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 7, pp. 1655–1668, Jul. 2019, conference Name: IEEE Transactions on Pattern Analysis and Machine Intelligence.
- H. Lai, P. Yin, and S. Scherer, “AdaFusion: Visual-LiDAR Fusion with Adaptive Weights for Place Recognition,” Nov. 2021, number: arXiv:2111.11739 arXiv:2111.11739 [cs]. http://arxiv.org/abs/2111.11739
- M. Larsson, E. Stenborg, C. Toft, L. Hammarstrand, T. Sattler, and F. Kahl, “Fine-Grained Segmentation Networks: Self-Supervised Segmentation for Improved Long-Term Visual Localization,” in 2019 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, 2019, pp. 31–41. https://ieeexplore.ieee.org/document/9010383/
- H. Hu, Z. Qiao, M. Cheng, Z. Liu, and H. Wang, “DASGIL: Domain Adaptation for Semantic and Geometric-aware Image-based Localization,” IEEE Transactions on Image Processing, vol. 30, pp. 1342–1353, 2020. http://arxiv.org/abs/2010.00573
- V. Paolicelli, A. Tavera, C. Masone, G. Berton, and B. Caputo, “Learning Semantics for Visual Place Recognition Through Multi-scale Attention,” in Image Analysis and Processing – ICIAP 2022, ser. Lecture Notes in Computer Science, S. Sclaroff, C. Distante, M. Leo, G. M. Farinella, and F. Tombari, Eds. Springer International Publishing, 2022, pp. 454–466.
- Y. Ming, X. Yang, G. Zhang, and A. Calway, “Cgis-net: Aggregating colour, geometry and implicit semantic features for indoor place recognition,” in 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2022, pp. 6991–6997.
- A. Ouni, E. Royer, M. Chevaldonné, and M. Dhome, “Leveraging semantic segmentation for hybrid image retrieval methods,” Neural Computing and Applications, vol. 34, no. 24, pp. 21 519–21 537, 2022. https://link.springer.com/10.1007/s00521-021-06087-3
- Q. Shi, J. Wu, Z. Lin, and N. Qin, “Learning a Robust Hybrid Descriptor for Robot Visual Localization,” Journal of Robotics, vol. 2022, pp. 1–11, 2022. https://www.hindawi.com/journals/jr/2022/9354909/
- F. Xue, I. Budvytis, D. O. Reino, and R. Cipolla, “Efficient Large-scale Localization by Global Instance Recognition,” in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2022, pp. 17 327–17 336. https://ieeexplore.ieee.org/document/9880192/
- C. Qiao, Z. Xiang, and X. Wang, “Objects matter: learning object relation graph for robust camera relocalization,” arXiv preprint arXiv:2205.13280, 2022.
- S. Garg, N. Suenderhauf, and M. Milford, “Lost? appearance-invariant place recognition for opposite viewpoints using visual semantics,” arXiv preprint arXiv:1804.05526, 2018.
- S. Hong, K. Li, Y. Zhang, Z. Fu, M. Liu, and Y. Guo, “Learning local features with context aggregation for visual localization,” arXiv preprint arXiv:2005.12880, 2020.
- A. J. Lee and H. Myung, “Natural language representation as features for place recognition,” in 2022 19th International Conference on Ubiquitous Robots (UR), 2022, pp. 284–287.
- Z. Lin, X. Peng, P. Cong, Y. Hou, X. Zhu, S. Yang, and Y. Ma, “Wildrefer: 3d object localization in large-scale dynamic scenes with multi-modal visual data and natural language,” arXiv preprint arXiv:2304.05645, 2023.
- D. Hettiarachchi, Y. Tian, H. Yu, and S. Kamijo, “Text spotting towards perceptually aliased urban place recognition,” Multimodal Technologies and Interaction, vol. 6, no. 11, 2022. https://www.mdpi.com/2414-4088/6/11/102
- X. Tang, D. Li, and M. Zhang, “Image retrieval for visual localization via scene text detection and logo filtering,” in 2022 7th International Conference on Image, Vision and Computing (ICIVC), 2022, pp. 662–668.
- A. Hermans, L. Beyer, and B. Leibe, “In defense of the triplet loss for person re-identification,” arXiv preprint arXiv:1703.07737, 2017.
- A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning. PMLR, 2021, pp. 8748–8763.
- Q. Wang, B. Wu, P. Zhu, P. Li, W. Zuo, and Q. Hu, “Eca-net: Efficient channel attention for deep convolutional neural networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
- M. A. Uy and G. H. Lee, “Pointnetvlad: Deep point cloud based retrieval for large-scale place recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
- G. Neuhold, T. Ollmann, S. R. Bulo, and P. Kontschieder, “The Mapillary Vistas Dataset for Semantic Understanding of Street Scenes,” in 2017 IEEE International Conference on Computer Vision (ICCV). IEEE, 2017, pp. 5000–5009. http://ieeexplore.ieee.org/document/8237796/