Context-Based Visual-Language Place Recognition (2410.19341v1)
Abstract: In vision-based robot localization and SLAM, Visual Place Recognition (VPR) is essential. This paper addresses the problem of VPR, which involves accurately recognizing the location corresponding to a given query image. A popular approach to vision-based place recognition relies on low-level visual features. Despite significant progress in recent years, place recognition based on low-level visual features is challenging when there are changes in scene appearance. To address this, end-to-end training approaches have been proposed to overcome the limitations of hand-crafted features. However, these approaches still fail under drastic changes and require large amounts of labeled data to train models, presenting a significant limitation. Methods that leverage high-level semantic information, such as objects or categories, have been proposed to handle variations in appearance. In this paper, we introduce a novel VPR approach that remains robust to scene changes and does not require additional training. Our method constructs semantic image descriptors by extracting pixel-level embeddings using a zero-shot, language-driven semantic segmentation model. We validate our approach in challenging place recognition scenarios using real-world public dataset. The experiments demonstrate that our method outperforms non-learned image representation techniques and off-the-shelf convolutional neural network (CNN) descriptors. Our code is available at https: //github.com/woo-soojin/context-based-vlpr.
- Z. Qian, J. Fu, and J. Xiao, “Towards accurate loop closure detection in semantic slam with 3d semantic covisibility graphs,” IEEE Robotics and Automation Letters, vol. 7, no. 2, pp. 2455–2462, 2022.
- R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic, “Netvlad: Cnn architecture for weakly supervised place recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 5297–5307.
- R. Mur-Artal and J. D. Tardós, “Fast relocalisation and loop closing in keyframe-based slam,” in 2014 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2014, pp. 846–853.
- D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” International journal of computer vision, vol. 60, pp. 91–110, 2004.
- E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “Orb: An efficient alternative to sift or surf,” in 2011 International conference on computer vision. Ieee, 2011, pp. 2564–2571.
- R. Mirjalili, M. Krawez, and W. Burgard, “Fm-loc: Using foundation models for improved vision-based localization,” in 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2023, pp. 1381–1387.
- H. Li, S. Yu, S. Zhang, and G. Tan, “Resolving loop closure confusion in repetitive environments for visual slam through ai foundation models assistance,” in 2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 6657–6663.
- B. Li, K. Q. Weinberger, S. Belongie, V. Koltun, and R. Ranftl, “Language-driven semantic segmentation,” arXiv preprint arXiv:2201.03546, 2022.
- R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos, “Orb-slam: a versatile and accurate monocular slam system,” IEEE transactions on robotics, vol. 31, no. 5, pp. 1147–1163, 2015.
- E. Stenborg, C. Toft, and L. Hammarstrand, “Long-term visual localization using semantically segmented images,” in 2018 IEEE international conference on robotics and automation (ICRA). IEEE, 2018, pp. 6484–6490.
- M. Kaneko, K. Iwami, T. Ogawa, T. Yamasaki, and K. Aizawa, “Mask-slam: Robust feature-based monocular slam by masking using semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition workshops, 2018, pp. 258–266.
- M. Cummins and P. Newman, “Fab-map: Probabilistic localization and mapping in the space of appearance,” The International journal of robotics research, vol. 27, no. 6, pp. 647–665, 2008.
- D. Gálvez-López and J. D. Tardos, “Bags of binary words for fast place recognition in image sequences,” IEEE Transactions on robotics, vol. 28, no. 5, pp. 1188–1197, 2012.
- E. Rosten and T. Drummond, “Machine learning for high-speed corner detection,” in Computer Vision–ECCV 2006: 9th European Conference on Computer Vision, Graz, Austria, May 7-13, 2006. Proceedings, Part I 9. Springer, 2006, pp. 430–443.
- M. Calonder, V. Lepetit, C. Strecha, and P. Fua, “Brief: Binary robust independent elementary features,” in Computer Vision–ECCV 2010: 11th European Conference on Computer Vision, Heraklion, Crete, Greece, September 5-11, 2010, Proceedings, Part IV 11. Springer, 2010, pp. 778–792.
- H. Bay, “Surf: Speeded up robust features,” Computer Vision—ECCV, 2006.
- S. Hausler, S. Garg, M. Xu, M. Milford, and T. Fischer, “Patch-netvlad: Multi-scale fusion of locally-global descriptors for place recognition,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 14 141–14 152.
- M. Hu, S. Li, J. Wu, J. Guo, H. Li, and X. Kang, “Loop closure detection for visual slam fusing semantic information,” in 2019 Chinese Control Conference (CCC). IEEE, 2019, pp. 4136–4141.
- J. Li, K. Koreitem, D. Meger, and G. Dudek, “View-invariant loop closure with oriented semantic landmarks,” in 2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2020, pp. 7943–7949.
- B. Li, D. Zou, D. Sartori, L. Pei, and W. Yu, “Textslam: Visual slam with planar text features,” in 2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2020, pp. 2102–2108.
- B. Li, D. Zou, Y. Huang, X. Niu, L. Pei, and W. Yu, “Textslam: Visual slam with semantic planar text features,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
- E. Heikel and L. Espinosa-Leal, “Indoor scene recognition via object detection and tf-idf,” Journal of Imaging, vol. 8, no. 8, p. 209, 2022.
- J. Redmon, “You only look once: Unified, real-time object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016.
- T. B. Brown, “Language models are few-shot learners,” arXiv preprint arXiv:2005.14165, 2020.
- A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning. PMLR, 2021, pp. 8748–8763.
- C. Li, L. Li, H. Jiang, K. Weng, Y. Geng, L. Li, Z. Ke, Q. Li, M. Cheng, W. Nie et al., “Yolov6: A single-stage object detection framework for industrial applications,” arXiv preprint arXiv:2209.02976, 2022.
- X. Zhou, C. Yao, H. Wen, Y. Wang, S. Zhou, W. He, and J. Liang, “East: an efficient and accurate scene text detector,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2017, pp. 5551–5560.
- J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,” in International conference on machine learning. PMLR, 2023, pp. 19 730–19 742.
- J. Behley, M. Garbade, A. Milioto, J. Quenzel, S. Behnke, C. Stachniss, and J. Gall, “Semantickitti: A dataset for semantic scene understanding of lidar sequences,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 9297–9307.
- F. Pomerleau, P. Krüsi, F. Colas, P. Furgale, and R. Siegwart, “Long-term 3d map maintenance in dynamic environments,” in 2014 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2014, pp. 3712–3719.
- W.-S. Hur, S.-T. Choi, S.-W. Kim, and S.-W. Seo, “Precise free space detection and its application to background extraction,” in 2015 IEEE 7th International Conference on Cybernetics and Intelligent Systems (CIS) and IEEE Conference on Robotics, Automation and Mechatronics (RAM). IEEE, 2015, pp. 179–184.
- S. Woo, D. Jung, and S.-W. Kim, “No more potentially dynamic objects: Static point cloud map generation based on 3d object detection and ground projection,” arXiv preprint arXiv:2407.01073, 2024.
- S.-W. Kim, G.-P. Gwon, W.-S. Hur, D. Hyeon, D.-Y. Kim, S.-H. Kim, D.-K. Kye, S.-H. Lee, S. Lee, M.-O. Shin et al., “Autonomous campus mobility services using driverless taxi,” IEEE Transactions on intelligent transportation systems, vol. 18, no. 12, pp. 3513–3526, 2017.
- A. Torii, J. Sivic, T. Pajdla, and M. Okutomi, “Visual place recognition with repetitive structures,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2013, pp. 883–890.
- A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” in 2012 IEEE conference on computer vision and pattern recognition. IEEE, 2012, pp. 3354–3361.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.