Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Leveraging Large Language Model-based Room-Object Relationships Knowledge for Enhancing Multimodal-Input Object Goal Navigation (2403.14163v1)

Published 21 Mar 2024 in cs.RO, cs.AI, and cs.CV

Abstract: Object-goal navigation is a crucial engineering task for the community of embodied navigation; it involves navigating to an instance of a specified object category within unseen environments. Although extensive investigations have been conducted on both end-to-end and modular-based, data-driven approaches, fully enabling an agent to comprehend the environment through perceptual knowledge and perform object-goal navigation as efficiently as humans remains a significant challenge. Recently, LLMs have shown potential in this task, thanks to their powerful capabilities for knowledge extraction and integration. In this study, we propose a data-driven, modular-based approach, trained on a dataset that incorporates common-sense knowledge of object-to-room relationships extracted from a LLM. We utilize the multi-channel Swin-Unet architecture to conduct multi-task learning incorporating with multimodal inputs. The results in the Habitat simulator demonstrate that our framework outperforms the baseline by an average of 10.6% in the efficiency metric, Success weighted by Path Length (SPL). The real-world demonstration shows that the proposed approach can efficiently conduct this task by traversing several rooms. For more details and real-world demonstrations, please check our project webpage (https://sunleyuan.github.io/ObjectNav).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (92)
  1. Introducing switched adaptive control for self-reconfigurable mobile cleaning robots, IEEE Transactions on Automation Science and Engineering (2023).
  2. On evaluation of embodied navigation agents, arXiv preprint arXiv:1807.06757 (2018).
  3. Visual representations for semantic target driven navigation, in: 2019 International Conference on Robotics and Automation (ICRA), IEEE, 2019, pp. 8846–8852.
  4. Indoor navigation for mobile agents: A multimodal vision fusion model, in: 2020 international joint conference on neural networks (IJCNN), IEEE, 2020, pp. 1–8.
  5. Situational fusion of visual representation for visual navigation, in: Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 2881–2890.
  6. I. B. de Andrade Santos, R. A. Romero, Deep reinforcement learning for visual semantic navigation with memory, in: 2020 Latin American Robotics Symposium (LARS), 2020 Brazilian Symposium on Robotics (SBR) and 2020 Workshop on Robotics in Education (WRE), IEEE, 2020, pp. 1–6.
  7. Object goal navigation using goal-oriented semantic exploration, Advances in Neural Information Processing Systems 33 (2020) 4247–4258.
  8. Poni: Potential functions for objectgoal navigation with interaction-free learning, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 18890–18900.
  9. L3mvn: Leveraging large language models for visual target navigation, arXiv preprint arXiv:2304.05501 (2023).
  10. A. J. Zhai, S. Wang, Peanut: predicting and navigating to unseen targets, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 10926–10935.
  11. Frontier semantic exploration for visual target navigation, in: 2023 IEEE International Conference on Robotics and Automation (ICRA), 2023, pp. 4099–4105.
  12. Object goal navigation in eobodied ai: A survey, in: Proceedings of the 2022 4th International Conference on Video, Signal and Image Processing, 2022, pp. 87–92.
  13. A survey of object goal navigation: Datasets, metrics and methods, in: 2023 IEEE International Conference on Mechatronics and Automation (ICMA), IEEE, 2023, pp. 2171–2176.
  14. Navigating to objects in the real world, Science Robotics 8 (2023) eadf6991.
  15. Optimal graph transformer viterbi knowledge inference network for more successful visual navigation, Advanced Engineering Informatics 55 (2023) 101889.
  16. Think holistically, act down-to-earth: A semantic navigation strategy with continuous environmental representation and multi-step forward planning, IEEE Transactions on Circuits and Systems for Video Technology (2023).
  17. Multi-object navigation using potential target position policy function, IEEE Transactions on Image Processing 32 (2023) 2608–2619.
  18. Graph transformer networks, Advances in neural information processing systems 32 (2019).
  19. Large language models for robotics: A survey, arXiv preprint arXiv:2311.07226 (2023).
  20. Navigation with large language models: Semantic guesswork as a heuristic for planning, in: Conference on Robot Learning, PMLR, 2023, pp. 2683–2699.
  21. J. A. Sethian, Fast marching methods, SIAM review 41 (1999) 199–235.
  22. Swin-unet: Unet-like pure transformer for medical image segmentation, in: European conference on computer vision, Springer, 2022, pp. 205–218.
  23. Visual prompting based incremental learning for semantic segmentation of multiplex immuno-flourescence microscopy imagery, Neuroinformatics (2024) 1–16.
  24. Multimodal learning with transformers: A survey, IEEE Transactions on Pattern Analysis and Machine Intelligence (2023).
  25. Learning transferable visual models from natural language supervision, in: International conference on machine learning, PMLR, 2021, pp. 8748–8763.
  26. Habitat-web: Learning embodied object-search strategies from human demonstrations at scale, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 5173–5183.
  27. Gibson env: Real-world perception for embodied agents, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 9068–9079.
  28. Matterport3d: Learning from rgb-d data in indoor environments, arXiv preprint arXiv:1709.06158 (2017).
  29. Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai, arXiv preprint arXiv:2109.08238 (2021).
  30. Object goal navigation with recursive implicit maps, in: 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE, 2023, pp. 7089–7096.
  31. Pyramid scene parsing network, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2881–2890.
  32. Room-object entity prompting and reasoning for embodied referring expression, IEEE Transactions on Pattern Analysis and Machine Intelligence 46 (2024) 994–1010.
  33. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 3674–3683.
  34. Esc: Exploration with soft commonsense constraints for zero-shot object navigation, in: International Conference on Machine Learning, PMLR, 2023, pp. 42829–42842.
  35. Zson: Zero-shot object-goal navigation using multimodal goal embeddings, Advances in Neural Information Processing Systems 35 (2022) 32340–32352.
  36. Ai2-thor: An interactive 3d environment for visual ai, arXiv preprint arXiv:1712.05474 (2017).
  37. Auxiliary tasks and exploration enable objectgoal navigation, in: Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 16117–16126.
  38. Zero experience required: Plug & play modular transfer learning for semantic visual navigation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 17031–17041.
  39. Minos: Multimodal indoor simulator for navigation in complex environments, arXiv preprint arXiv:1712.03931 (2017).
  40. Auxiliary tasks and exploration enable objectgoal navigation, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 16117–16126.
  41. B. Yamauchi, A frontier-based approach for autonomous exploration, in: Proceedings 1997 IEEE International Symposium on Computational Intelligence in Robotics and Automation CIRA’97.’Towards New Computational Principles for Robotics and Automation’, IEEE, 1997, pp. 146–151.
  42. Lm-nav: Robotic navigation with large pre-trained models of language, vision, and action, in: Conference on robot learning, PMLR, 2023, pp. 492–504.
  43. Voronav: Voronoi-based zero-shot object navigation with large language model, arXiv preprint arXiv:2401.02695 (2024).
  44. Mapgpt: Map-guided prompting for unified vision-and-language navigation, arXiv preprint arXiv:2401.07314 (2024).
  45. Semantic audio-visual navigation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 15516–15525.
  46. Mask r-cnn, in: Proceedings of the IEEE international conference on computer vision, 2017, pp. 2961–2969.
  47. Clipgraphs: Multimodal graph networks to infer object-room affinities, arXiv preprint arXiv:2306.01540 (2023).
  48. Chain-of-thought prompting elicits reasoning in large language models, Advances in Neural Information Processing Systems 35 (2022) 24824–24837.
  49. 3d scene graph: A structure for unified semantics, 3d space, and camera, in: Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 5664–5673.
  50. Swin transformer: Hierarchical vision transformer using shifted windows, in: Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10012–10022.
  51. Channel vision transformers: An image is worth c x 16 x 16 words, arXiv preprint arXiv:2309.16108 (2023).
  52. Sat3d: Slot attention transformer for 3d point cloud semantic segmentation, IEEE Transactions on Intelligent Transportation Systems (2023).
  53. Vilt: Vision-and-language transformer without convolution or region supervision, in: International Conference on Machine Learning, PMLR, 2021, pp. 5583–5594.
  54. Aft-vo: Asynchronous fusion transformers for multi-view visual odometry estimation, in: 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE, 2022, pp. 2402–2408.
  55. Transfusionodom: Transformer-based lidar-inertial fusion odometry estimation, IEEE Sensors Journal (2023).
  56. U-net: Convolutional networks for biomedical image segmentation, in: Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, Springer, 2015, pp. 234–241.
  57. Certainodom: Uncertainty weighted multi-task learning model for lidar odometry estimation, in: 2022 IEEE International Conference on Robotics and Biomimetics (ROBIO), 2022, pp. 121–128. doi:10.1109/ROBIO55434.2022.10011808.
  58. A. Kendall, Y. Gal, What uncertainties do we need in bayesian deep learning for computer vision?, Advances in neural information processing systems 30 (2017).
  59. Unsupervised balanced covariance learning for visual-inertial sensor fusion, IEEE Robotics and Automation Letters 6 (2021) 819–826.
  60. D. Sinha, M. El-Sharkawy, Thin mobilenet: An enhanced mobilenet architecture, in: 2019 IEEE 10th annual ubiquitous computing, electronics & mobile communication conference (UEMCON), IEEE, 2019, pp. 0280–0285.
  61. Relation networks for object detection, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 3588–3597.
  62. Local relation networks for image recognition, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 3464–3473.
  63. Unsupervised scale-consistent depth and ego-motion learning from monocular video, Advances in neural information processing systems 32 (2019).
  64. Spatial-aware dynamic lightweight self-supervised monocular depth estimation, IEEE Robotics and Automation Letters 9 (2023) 883–890.
  65. Self-supervised multi-frame monocular depth estimation for dynamic scenes, IEEE Transactions on Circuits and Systems for Video Technology (2023).
  66. Unsupervised monocular depth estimation for monocular visual slam systems, IEEE Transactions on Instrumentation and Measurement (2023).
  67. Deeplio: deep lidar inertial sensor fusion for odometry estimation, ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences 1 (2021) 47–54.
  68. On the uncertainty of self-supervised monocular depth estimation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 3227–3237.
  69. Uncertainty-aware self-improving framework for depth estimation, IEEE Robotics and Automation Letters 7 (2021) 41–48.
  70. Image quality assessment: from error visibility to structural similarity, IEEE transactions on image processing 13 (2004) 600–612.
  71. Microsoft coco: Common objects in context, in: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, Springer, 2014, pp. 740–755.
  72. Rednet: Residual encoder-decoder network for indoor rgb-d semantic segmentation, arXiv preprint arXiv:1806.01054 (2018).
  73. Thda: Treasure hunt data augmentation for semantic navigation, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 15374–15383.
  74. Pytorch: An imperative style, high-performance deep learning library, Advances in neural information processing systems 32 (2019).
  75. D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980 (2014).
  76. Learning to explore using active neural slam, arXiv preprint arXiv:2004.05155 (2020).
  77. Occupancy anticipation for efficient exploration and navigation, in: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16, Springer, 2020, pp. 400–418.
  78. Dd-ppo: Learning near-perfect pointgoal navigators from 2.5 billion frames, arXiv preprint arXiv:1911.00357 (2019).
  79. Stubborn: A strong baseline for indoor object navigation, in: 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE, 2022, pp. 3287–3293.
  80. A flexible and scalable slam system with full 3d motion estimation, in: 2011 IEEE international symposium on safety, security, and rescue robotics, IEEE, 2011, pp. 155–160.
  81. Visual semantic navigation with real robots, arXiv preprint arXiv:2311.16623 (2023).
  82. Learning to set waypoints for audio-visual navigation, arXiv preprint arXiv:2008.09622 (2020).
  83. J. Zhang, S. Singh, Visual-lidar odometry and mapping: Low-drift, robust, and fast, in: 2015 IEEE International Conference on Robotics and Automation (ICRA), IEEE, 2015, pp. 2174–2181.
  84. X. Zheng, J. Zhu, Traj-lio: A resilient multi-lidar multi-imu state estimator through sparse gaussian process, arXiv preprint arXiv:2402.09189 (2024).
  85. Are we ready for autonomous driving? the kitti vision benchmark suite, in: 2012 IEEE conference on computer vision and pattern recognition, IEEE, 2012, pp. 3354–3361.
  86. L. Yang, L. Wang, A semantic slam-based dense mapping approach for large-scale dynamic outdoor environment, Measurement 204 (2022) 112001.
  87. T. Hempel, A. Al-Hamadi, An online semantic mapping system for extending and enhancing visual slam, Engineering Applications of Artificial Intelligence 111 (2022) 104830.
  88. L. Yang, H. Cai, Enhanced visual slam for construction robots by efficient integration of dynamic object segmentation and scene semantics, Advanced Engineering Informatics 59 (2024) 102313.
  89. Visual slam framework based on segmentation with the improvement of loop closure detection in dynamic environments, Journal of Robotics and Mechatronics 33 (2021) 1385–1397.
  90. Bad slam: Bundle adjusted direct rgb-d slam, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 134–144.
  91. Mask dino: Towards a unified transformer-based framework for object detection and segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 3041–3050.
  92. Efficient multimodal semantic segmentation via dual-prompt learning, arXiv preprint arXiv:2312.00360 (2023).
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Leyuan Sun (4 papers)
  2. Asako Kanezaki (25 papers)
  3. Guillaume Caron (5 papers)
  4. Yusuke Yoshiyasu (13 papers)
Citations (1)

Summary

We haven't generated a summary for this paper yet.