Leveraging Large Language Model-based Room-Object Relationships Knowledge for Enhancing Multimodal-Input Object Goal Navigation (2403.14163v1)
Abstract: Object-goal navigation is a crucial engineering task for the community of embodied navigation; it involves navigating to an instance of a specified object category within unseen environments. Although extensive investigations have been conducted on both end-to-end and modular-based, data-driven approaches, fully enabling an agent to comprehend the environment through perceptual knowledge and perform object-goal navigation as efficiently as humans remains a significant challenge. Recently, LLMs have shown potential in this task, thanks to their powerful capabilities for knowledge extraction and integration. In this study, we propose a data-driven, modular-based approach, trained on a dataset that incorporates common-sense knowledge of object-to-room relationships extracted from a LLM. We utilize the multi-channel Swin-Unet architecture to conduct multi-task learning incorporating with multimodal inputs. The results in the Habitat simulator demonstrate that our framework outperforms the baseline by an average of 10.6% in the efficiency metric, Success weighted by Path Length (SPL). The real-world demonstration shows that the proposed approach can efficiently conduct this task by traversing several rooms. For more details and real-world demonstrations, please check our project webpage (https://sunleyuan.github.io/ObjectNav).
- Introducing switched adaptive control for self-reconfigurable mobile cleaning robots, IEEE Transactions on Automation Science and Engineering (2023).
- On evaluation of embodied navigation agents, arXiv preprint arXiv:1807.06757 (2018).
- Visual representations for semantic target driven navigation, in: 2019 International Conference on Robotics and Automation (ICRA), IEEE, 2019, pp. 8846–8852.
- Indoor navigation for mobile agents: A multimodal vision fusion model, in: 2020 international joint conference on neural networks (IJCNN), IEEE, 2020, pp. 1–8.
- Situational fusion of visual representation for visual navigation, in: Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 2881–2890.
- I. B. de Andrade Santos, R. A. Romero, Deep reinforcement learning for visual semantic navigation with memory, in: 2020 Latin American Robotics Symposium (LARS), 2020 Brazilian Symposium on Robotics (SBR) and 2020 Workshop on Robotics in Education (WRE), IEEE, 2020, pp. 1–6.
- Object goal navigation using goal-oriented semantic exploration, Advances in Neural Information Processing Systems 33 (2020) 4247–4258.
- Poni: Potential functions for objectgoal navigation with interaction-free learning, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 18890–18900.
- L3mvn: Leveraging large language models for visual target navigation, arXiv preprint arXiv:2304.05501 (2023).
- A. J. Zhai, S. Wang, Peanut: predicting and navigating to unseen targets, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 10926–10935.
- Frontier semantic exploration for visual target navigation, in: 2023 IEEE International Conference on Robotics and Automation (ICRA), 2023, pp. 4099–4105.
- Object goal navigation in eobodied ai: A survey, in: Proceedings of the 2022 4th International Conference on Video, Signal and Image Processing, 2022, pp. 87–92.
- A survey of object goal navigation: Datasets, metrics and methods, in: 2023 IEEE International Conference on Mechatronics and Automation (ICMA), IEEE, 2023, pp. 2171–2176.
- Navigating to objects in the real world, Science Robotics 8 (2023) eadf6991.
- Optimal graph transformer viterbi knowledge inference network for more successful visual navigation, Advanced Engineering Informatics 55 (2023) 101889.
- Think holistically, act down-to-earth: A semantic navigation strategy with continuous environmental representation and multi-step forward planning, IEEE Transactions on Circuits and Systems for Video Technology (2023).
- Multi-object navigation using potential target position policy function, IEEE Transactions on Image Processing 32 (2023) 2608–2619.
- Graph transformer networks, Advances in neural information processing systems 32 (2019).
- Large language models for robotics: A survey, arXiv preprint arXiv:2311.07226 (2023).
- Navigation with large language models: Semantic guesswork as a heuristic for planning, in: Conference on Robot Learning, PMLR, 2023, pp. 2683–2699.
- J. A. Sethian, Fast marching methods, SIAM review 41 (1999) 199–235.
- Swin-unet: Unet-like pure transformer for medical image segmentation, in: European conference on computer vision, Springer, 2022, pp. 205–218.
- Visual prompting based incremental learning for semantic segmentation of multiplex immuno-flourescence microscopy imagery, Neuroinformatics (2024) 1–16.
- Multimodal learning with transformers: A survey, IEEE Transactions on Pattern Analysis and Machine Intelligence (2023).
- Learning transferable visual models from natural language supervision, in: International conference on machine learning, PMLR, 2021, pp. 8748–8763.
- Habitat-web: Learning embodied object-search strategies from human demonstrations at scale, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 5173–5183.
- Gibson env: Real-world perception for embodied agents, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 9068–9079.
- Matterport3d: Learning from rgb-d data in indoor environments, arXiv preprint arXiv:1709.06158 (2017).
- Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai, arXiv preprint arXiv:2109.08238 (2021).
- Object goal navigation with recursive implicit maps, in: 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE, 2023, pp. 7089–7096.
- Pyramid scene parsing network, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2881–2890.
- Room-object entity prompting and reasoning for embodied referring expression, IEEE Transactions on Pattern Analysis and Machine Intelligence 46 (2024) 994–1010.
- Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 3674–3683.
- Esc: Exploration with soft commonsense constraints for zero-shot object navigation, in: International Conference on Machine Learning, PMLR, 2023, pp. 42829–42842.
- Zson: Zero-shot object-goal navigation using multimodal goal embeddings, Advances in Neural Information Processing Systems 35 (2022) 32340–32352.
- Ai2-thor: An interactive 3d environment for visual ai, arXiv preprint arXiv:1712.05474 (2017).
- Auxiliary tasks and exploration enable objectgoal navigation, in: Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 16117–16126.
- Zero experience required: Plug & play modular transfer learning for semantic visual navigation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 17031–17041.
- Minos: Multimodal indoor simulator for navigation in complex environments, arXiv preprint arXiv:1712.03931 (2017).
- Auxiliary tasks and exploration enable objectgoal navigation, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 16117–16126.
- B. Yamauchi, A frontier-based approach for autonomous exploration, in: Proceedings 1997 IEEE International Symposium on Computational Intelligence in Robotics and Automation CIRA’97.’Towards New Computational Principles for Robotics and Automation’, IEEE, 1997, pp. 146–151.
- Lm-nav: Robotic navigation with large pre-trained models of language, vision, and action, in: Conference on robot learning, PMLR, 2023, pp. 492–504.
- Voronav: Voronoi-based zero-shot object navigation with large language model, arXiv preprint arXiv:2401.02695 (2024).
- Mapgpt: Map-guided prompting for unified vision-and-language navigation, arXiv preprint arXiv:2401.07314 (2024).
- Semantic audio-visual navigation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 15516–15525.
- Mask r-cnn, in: Proceedings of the IEEE international conference on computer vision, 2017, pp. 2961–2969.
- Clipgraphs: Multimodal graph networks to infer object-room affinities, arXiv preprint arXiv:2306.01540 (2023).
- Chain-of-thought prompting elicits reasoning in large language models, Advances in Neural Information Processing Systems 35 (2022) 24824–24837.
- 3d scene graph: A structure for unified semantics, 3d space, and camera, in: Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 5664–5673.
- Swin transformer: Hierarchical vision transformer using shifted windows, in: Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10012–10022.
- Channel vision transformers: An image is worth c x 16 x 16 words, arXiv preprint arXiv:2309.16108 (2023).
- Sat3d: Slot attention transformer for 3d point cloud semantic segmentation, IEEE Transactions on Intelligent Transportation Systems (2023).
- Vilt: Vision-and-language transformer without convolution or region supervision, in: International Conference on Machine Learning, PMLR, 2021, pp. 5583–5594.
- Aft-vo: Asynchronous fusion transformers for multi-view visual odometry estimation, in: 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE, 2022, pp. 2402–2408.
- Transfusionodom: Transformer-based lidar-inertial fusion odometry estimation, IEEE Sensors Journal (2023).
- U-net: Convolutional networks for biomedical image segmentation, in: Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, Springer, 2015, pp. 234–241.
- Certainodom: Uncertainty weighted multi-task learning model for lidar odometry estimation, in: 2022 IEEE International Conference on Robotics and Biomimetics (ROBIO), 2022, pp. 121–128. doi:10.1109/ROBIO55434.2022.10011808.
- A. Kendall, Y. Gal, What uncertainties do we need in bayesian deep learning for computer vision?, Advances in neural information processing systems 30 (2017).
- Unsupervised balanced covariance learning for visual-inertial sensor fusion, IEEE Robotics and Automation Letters 6 (2021) 819–826.
- D. Sinha, M. El-Sharkawy, Thin mobilenet: An enhanced mobilenet architecture, in: 2019 IEEE 10th annual ubiquitous computing, electronics & mobile communication conference (UEMCON), IEEE, 2019, pp. 0280–0285.
- Relation networks for object detection, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 3588–3597.
- Local relation networks for image recognition, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 3464–3473.
- Unsupervised scale-consistent depth and ego-motion learning from monocular video, Advances in neural information processing systems 32 (2019).
- Spatial-aware dynamic lightweight self-supervised monocular depth estimation, IEEE Robotics and Automation Letters 9 (2023) 883–890.
- Self-supervised multi-frame monocular depth estimation for dynamic scenes, IEEE Transactions on Circuits and Systems for Video Technology (2023).
- Unsupervised monocular depth estimation for monocular visual slam systems, IEEE Transactions on Instrumentation and Measurement (2023).
- Deeplio: deep lidar inertial sensor fusion for odometry estimation, ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences 1 (2021) 47–54.
- On the uncertainty of self-supervised monocular depth estimation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 3227–3237.
- Uncertainty-aware self-improving framework for depth estimation, IEEE Robotics and Automation Letters 7 (2021) 41–48.
- Image quality assessment: from error visibility to structural similarity, IEEE transactions on image processing 13 (2004) 600–612.
- Microsoft coco: Common objects in context, in: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, Springer, 2014, pp. 740–755.
- Rednet: Residual encoder-decoder network for indoor rgb-d semantic segmentation, arXiv preprint arXiv:1806.01054 (2018).
- Thda: Treasure hunt data augmentation for semantic navigation, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 15374–15383.
- Pytorch: An imperative style, high-performance deep learning library, Advances in neural information processing systems 32 (2019).
- D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980 (2014).
- Learning to explore using active neural slam, arXiv preprint arXiv:2004.05155 (2020).
- Occupancy anticipation for efficient exploration and navigation, in: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16, Springer, 2020, pp. 400–418.
- Dd-ppo: Learning near-perfect pointgoal navigators from 2.5 billion frames, arXiv preprint arXiv:1911.00357 (2019).
- Stubborn: A strong baseline for indoor object navigation, in: 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE, 2022, pp. 3287–3293.
- A flexible and scalable slam system with full 3d motion estimation, in: 2011 IEEE international symposium on safety, security, and rescue robotics, IEEE, 2011, pp. 155–160.
- Visual semantic navigation with real robots, arXiv preprint arXiv:2311.16623 (2023).
- Learning to set waypoints for audio-visual navigation, arXiv preprint arXiv:2008.09622 (2020).
- J. Zhang, S. Singh, Visual-lidar odometry and mapping: Low-drift, robust, and fast, in: 2015 IEEE International Conference on Robotics and Automation (ICRA), IEEE, 2015, pp. 2174–2181.
- X. Zheng, J. Zhu, Traj-lio: A resilient multi-lidar multi-imu state estimator through sparse gaussian process, arXiv preprint arXiv:2402.09189 (2024).
- Are we ready for autonomous driving? the kitti vision benchmark suite, in: 2012 IEEE conference on computer vision and pattern recognition, IEEE, 2012, pp. 3354–3361.
- L. Yang, L. Wang, A semantic slam-based dense mapping approach for large-scale dynamic outdoor environment, Measurement 204 (2022) 112001.
- T. Hempel, A. Al-Hamadi, An online semantic mapping system for extending and enhancing visual slam, Engineering Applications of Artificial Intelligence 111 (2022) 104830.
- L. Yang, H. Cai, Enhanced visual slam for construction robots by efficient integration of dynamic object segmentation and scene semantics, Advanced Engineering Informatics 59 (2024) 102313.
- Visual slam framework based on segmentation with the improvement of loop closure detection in dynamic environments, Journal of Robotics and Mechatronics 33 (2021) 1385–1397.
- Bad slam: Bundle adjusted direct rgb-d slam, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 134–144.
- Mask dino: Towards a unified transformer-based framework for object detection and segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 3041–3050.
- Efficient multimodal semantic segmentation via dual-prompt learning, arXiv preprint arXiv:2312.00360 (2023).
- Leyuan Sun (4 papers)
- Asako Kanezaki (25 papers)
- Guillaume Caron (5 papers)
- Yusuke Yoshiyasu (13 papers)