Active Semantic Localization with Graph Neural Embedding (2305.06141v5)
Abstract: Semantic localization, i.e., robot self-localization with semantic image modality, is critical in recently emerging embodied AI applications (e.g., point-goal navigation, object-goal navigation, vision language navigation) and topological mapping applications (e.g., graph neural SLAM, ego-centric topological map). However, most existing works on semantic localization focus on passive vision tasks without viewpoint planning, or rely on additional rich modalities (e.g., depth measurements). Thus, the problem is largely unsolved. In this work, we explore a lightweight, entirely CPU-based, domain-adaptive semantic localization framework, called graph neural localizer. Our approach is inspired by two recently emerging technologies: (1) Scene graph, which combines the viewpoint- and appearance- invariance of local and global features; (2) Graph neural network, which enables direct learning/recognition of graph data (i.e., non-vector data). Specifically, a graph convolutional neural network is first trained as a scene graph classifier for passive vision, and then its knowledge is transferred to a reinforcement-learning planner for active vision. Experiments on two scenarios, self-supervised learning and unsupervised domain adaptation, using a photo-realistic Habitat simulator validate the effectiveness of the proposed method.
- S. S. Desai and S. Lee, “Auxiliary tasks for efficient learning of point-goal navigation,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2021, pp. 717–725.
- J. Ye, D. Batra, E. Wijmans, and A. Das, “Auxiliary tasks speed up learning point goal navigation,” in Conference on Robot Learning. PMLR, 2021, pp. 498–516.
- D. S. Chaplot, D. P. Gandhi, A. Gupta, and R. R. Salakhutdinov, “Object goal navigation using goal-oriented semantic exploration,” Advances in Neural Information Processing Systems, vol. 33, pp. 4247–4258, 2020.
- H. Wang, W. Wang, W. Liang, C. Xiong, and J. Shen, “Structured scene memory for vision-language navigation,” in Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, 2021, pp. 8455–8464.
- S. Datta, O. Maksymets, J. Hoffman, S. Lee, D. Batra, and D. Parikh, “Integrating egocentric localization for more realistic point-goal navigation agents,” in Conference on Robot Learning. PMLR, 2021, pp. 313–328.
- J. L. Schönberger, M. Pollefeys, A. Geiger, and T. Sattler, “Semantic visual localization,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 6896–6906.
- C. Toft, C. Olsson, and F. Kahl, “Long-term 3d localization and pose from semantic labellings,” in Proceedings of the IEEE International Conference on Computer Vision Workshops, 2017, pp. 650–659.
- M. Mancini, S. R. Bulo, E. Ricci, and B. Caputo, “Learning deep nbnn representations for robust place categorization,” IEEE Robotics and Automation Letters, vol. 2, no. 3, pp. 1794–1801, 2017.
- D. S. Chaplot, E. Parisotto, and R. Salakhutdinov, “Active neural localization,” arXiv preprint arXiv:1801.08214, 2018.
- K. Kim, H. Choi, S. Yoon, K. Lee, H. Ryu, C. Woo, and Y. Kwak, “Development of docking system for mobile robots using cheap infrared sensors,” in Proceedings of the 1st International Conference on Sensing Technology. Citeseer, 2005, pp. 287–291.
- P. Xu, X. Chang, L. Guo, P.-Y. Huang, X. Chen, and A. G. Hauptmann, “A survey of scene graph: Generation and application,” IEEE Trans. Neural Netw. Learn. Syst, vol. 1, 2020.
- Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and S. Y. Philip, “A comprehensive survey on graph neural networks,” IEEE transactions on neural networks and learning systems, vol. 32, no. 1, pp. 4–24, 2020.
- K. Kurauchi, K. Tanaka, R. Yamamoto, and M. Yoshida, “Active domain-invariant self-localization using ego-centric and world-centric maps,” in Computer Vision and Machine Intelligence, M. Tistarelli, S. R. Dubey, S. K. Singh, and X. Jiang, Eds. Singapore: Springer Nature Singapore, 2023, pp. 475–487.
- G. Zhu, L. Zhang, Y. Jiang, Y. Dang, H. Hou, P. Shen, M. Feng, X. Zhao, Q. Miao, S. A. A. Shah, et al., “Scene graph generation: A comprehensive survey,” arXiv preprint arXiv:2201.00443, 2022.
- G. V. Cormack, C. L. Clarke, and S. Buettcher, “Reciprocal rank fusion outperforms condorcet and individual rank learning methods,” in Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, 2009, pp. 758–759.
- D. Shah and Q. Xie, “Q-learning with nearest neighbors,” Advances in Neural Information Processing Systems, vol. 31, 2018.
- X. Liu, F. Zhang, Z. Hou, L. Mian, Z. Wang, J. Zhang, and J. Tang, “Self-supervised learning: Generative or contrastive,” IEEE Transactions on Knowledge and Data Engineering, vol. 35, no. 1, pp. 857–876, 2021.
- B. Xu, Z. Zeng, C. Lian, and Z. Ding, “Few-shot domain adaptation via mixup optimal transport,” IEEE Transactions on Image Processing, vol. 31, pp. 2518–2528, 2022.
- A. Szot, A. Clegg, E. Undersander, E. Wijmans, Y. Zhao, J. Turner, N. Maestre, M. Mukadam, D. Chaplot, O. Maksymets, A. Gokaslan, V. Vondrus, S. Dharur, F. Meier, W. Galuba, A. Chang, Z. Kira, V. Koltun, J. Malik, M. Savva, and D. Batra, “Habitat 2.0: Training home assistants to rearrange their habitat,” in Advances in Neural Information Processing Systems (NeurIPS), 2021.
- C. Masone and B. Caputo, “A survey on deep visual place recognition,” IEEE Access, vol. 9, pp. 19 516–19 547, 2021.
- M. Cummins and P. Newman, “Appearance-only slam at large scale with fab-map 2.0,” The International Journal of Robotics Research, vol. 30, no. 9, pp. 1100–1123, 2011.
- F. Boniardi, A. Valada, R. Mohan, T. Caselitz, and W. Burgard, “Robot localization in floor plans using a room layout edge extraction network,” in 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2019, pp. 5291–5297.
- T. Weyand, I. Kostrikov, and J. Philbin, “Planet-photo geolocation with convolutional neural networks,” in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VIII 14. Springer, 2016, pp. 37–55.
- N. Mo, W. Gan, N. Yokoya, and S. Chen, “Es6d: A computation efficient and symmetry-aware 6d pose regression framework,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 6718–6727.
- K. A. Tsintotas, L. Bampis, and A. Gasteratos, “The revisiting problem in simultaneous localization and mapping: A survey on visual loop closure detection,” IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 11, pp. 19 929–19 953, 2022.
- G. Kim, B. Park, and A. Kim, “1-day learning, 1-year localization: Long-term lidar localization using scan context image,” IEEE Robotics and Automation Letters, vol. 4, no. 2, pp. 1948–1955, 2019.
- S. Lowry, N. Sünderhauf, P. Newman, J. J. Leonard, D. Cox, P. Corke, and M. J. Milford, “Visual place recognition: A survey,” ieee transactions on robotics, vol. 32, no. 1, pp. 1–19, 2015.
- E. Garcia-Fidalgo and A. Ortiz, “ibow-lcd: An appearance-based loop-closure detection approach using incremental bags of binary words,” IEEE Robotics and Automation Letters, vol. 3, no. 4, pp. 3051–3057, 2018.
- F. Bonin-Font and A. Burguera Burguera, “Nethaloc: A learned global image descriptor for loop closing in underwater visual slam,” Expert Systems, vol. 38, no. 2, p. e12635, 2021.
- N. Sünderhauf, S. Shirazi, F. Dayoub, B. Upcroft, and M. Milford, “On the performance of convnet features for place recognition,” in 2015 IEEE/RSJ international conference on intelligent robots and systems (IROS). IEEE, 2015, pp. 4297–4304.
- Y. Song and M. Soleymani, “Polysemous visual-semantic embedding for cross-modal retrieval,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 1979–1988.
- B. Zhou and P. Krähenbühl, “Cross-view transformers for real-time map-view semantic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 13 760–13 769.
- L. Wang, D. Li, H. Liu, J. Peng, L. Tian, and Y. Shan, “Cross-dataset collaborative learning for semantic segmentation in autonomous driving,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 3, 2022, pp. 2487–2494.
- S. K. Gottipati, K. Seo, D. Bhatt, V. Mai, K. Murthy, and L. Paull, “Deep active localization,” IEEE Robotics and Automation Letters, vol. 4, no. 4, pp. 4394–4401, 2019.
- M. Ragab, E. Eldele, W. L. Tan, C.-S. Foo, Z. Chen, M. Wu, C.-K. Kwoh, and X. Li, “Adatime: A benchmarking suite for domain adaptation on time series data,” arXiv preprint arXiv:2203.08321, 2022.
- B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba, “Scene parsing through ade20k dataset,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 633–641.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
- H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2881–2890.
- Y. Cao, C. Wang, Z. Li, L. Zhang, and L. Zhang, “Spatial-bag-of-features,” in 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE, 2010, pp. 3352–3359.
- T. Ohta, K. Tanaka, and R. Yamamoto, “Scene graph descriptors for visual place classification from noisy scene data,” ICT Express, 2023.
- M. Wang, L. Yu, D. Zheng, Q. Gan, Y. Gai, Z. Ye, M. Li, J. Zhou, Q. Huang, C. Ma, Z. Huang, Q. Guo, H. Zhang, H. Lin, J. Zhao, J. Li, A. J. Smola, and Z. Zhang, “Deep graph library: Towards efficient and scalable deep learning on graphs,” CoRR, vol. abs/1909.01315, 2019. [Online]. Available: http://arxiv.org/abs/1909.01315
- A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Communications of the ACM, vol. 60, no. 6, pp. 84–90, 2017.
- K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” Advances in neural information processing systems, vol. 27, 2014.
- A. Roy and S. Todorovic, “Monocular depth estimation using neural regression forest,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 5506–5514.
- O. Kurland and J. S. Culpepper, “Fusion in information retrieval: Sigir 2018 half-day tutorial,” in The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, 2018, pp. 1383–1386.
- G. Huang, “Particle filtering with analytically guided sampling,” Advanced Robotics, vol. 31, no. 17, pp. 932–945, 2017.
- R. Kemker, M. McClure, A. Abitino, T. Hayes, and C. Kanan, “Measuring catastrophic forgetting in neural networks,” in Proceedings of the AAAI conference on artificial intelligence, vol. 32, no. 1, 2018.
- S. Laine and T. Aila, “Temporal ensembling for semi-supervised learning,” arXiv preprint arXiv:1610.02242, 2016.
- K. Aiba, K. Tanaka, and R. Yamamoto, “Detecting landmark misrecognition in pose-graph SLAM via minimum cost multicuts,” in IEEE 9th International Conference on Computational Intelligence and Virtual Environments for Measurement Systems and Applications, CIVEMSA 2022, Chemnitz, Germany, June 15-17, 2022. IEEE, 2022, pp. 1–5. [Online]. Available: https://doi.org/10.1109/CIVEMSA53371.2022.9853684