OrionNav: Online Planning for Robot Autonomy with Context-Aware LLM and Open-Vocabulary Semantic Scene Graphs (2410.06239v2)
Abstract: Enabling robots to autonomously navigate unknown, complex, dynamic environments and perform diverse tasks remains a fundamental challenge in developing robust autonomous physical agents. These agents must effectively perceive their surroundings while leveraging world knowledge for decision-making. Although recent approaches utilize vision-language and LLMs for scene understanding and planning, they often rely on offline processing, offboard compute, make simplifying assumptions about the environment and perception, limiting real-world applicability. We present a novel framework for real-time onboard autonomous navigation in unknown environments that change over time by integrating multi-level abstraction in both perception and planning pipelines. Our system fuses data from multiple onboard sensors for localization and mapping and integrates it with open-vocabulary semantics to generate hierarchical scene graphs from continuously updated semantic object map. The LLM-based planner uses these graphs to create multi-step plans that guide low-level controllers in executing navigation tasks specified in natural language. The system's real-time operation enables the LLM to adjust its plans based on updates to the scene graph and task execution status, ensuring continuous adaptation to new situations or when the current plan cannot accomplish the task, a key advantage over static or rule-based systems. We demonstrate our system's efficacy on a quadruped navigating dynamic environments, showcasing its adaptability and robustness in diverse scenarios.
- T. Zhang, X. Hu, J. Xiao, and G. Zhang, “A survey of visual navigation: From geometry to embodied AI,” Engineering Applications of Artificial Intelligence, vol. 114, p. 105036, 2022.
- J. Duan, S. Yu, H. L. Tan, H. Zhu, and C. Tan, “A survey of embodied ai: From simulators to research tasks,” IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 6, no. 2, pp. 230–244, 2022.
- F. Safavi, P. Olikkal, D. Pei, S. Kamal, H. Meyerson, V. Penumalee, and R. Vinjamuri, “Emerging frontiers in human-robot interaction,” J. Intell. Robotic Syst., vol. 110, no. 2, p. 45, 2024.
- D.-S. Jang, D.-H. Cho, W.-C. Lee, S.-K. Ryu, B. Jeong, M. Hong, M. Jung, M. Kim, M. Lee, S. Lee, et al., “Unlocking robotic autonomy: A survey on the applications of foundation models,” International Journal of Control, Automation and Systems, vol. 22, no. 8, pp. 2341–2384, 2024.
- C. Cadena, L. Carlone, H. Carrillo, Y. Latif, D. Scaramuzza, J. Neira, I. Reid, and J. J. Leonard, “Past, present, and future of simultaneous localization and mapping: Toward the robust-perception age,” IEEE Transactions on Robotics, vol. 32, no. 6, pp. 1309–1332, 2016.
- H. Wang and M. Li, “A new era of indoor scene reconstruction: A survey,” IEEE Access, vol. 12, pp. 110 160–110 192, 2024.
- H. Yin, X. Xu, S. Lu, X. Chen, R. Xiong, S. Shen, C. Stachniss, and Y. Wang, “A survey on global lidar localization: Challenges, advances and open problems,” International Journal of Computer Vision, vol. 132, no. 8, pp. 3139–3171, 2024.
- A. Rosinol, A. Violette, M. Abate, N. Hughes, Y. Chang, J. Shi, A. Gupta, and L. Carlone, “Kimera: From SLAM to spatial perception with 3d dynamic scene graphs,” International Journal of Robotics Research, vol. 40, no. 12-14, pp. 1510–1546, 2021.
- Y. Tao, X. Liu, I. Spasojevic, S. Agarwal, and V. Kumar, “3d active metric-semantic SLAM,” IEEE Robotics and Automation Letters, vol. 9, no. 3, pp. 2989–2996, 2024.
- Y. Wang, Y. Tian, J. Chen, K. Xu, and X. Ding, “A survey of visual SLAM in dynamic environment: The evolution from geometric to semantic approaches,” IEEE Transactions on Instrumentation and Measurement, vol. 73, pp. 1–21, 2024.
- C. Zhu and L. Chen, “A survey on open-vocabulary detection and segmentation: Past, present, and future,” CoRR, vol. abs/2307.09220, 2023.
- J. Wu, X. Li, S. Xu, H. Yuan, H. Ding, Y. Yang, X. Li, J. Zhang, Y. Tong, X. Jiang, B. Ghanem, and D. Tao, “Towards open vocabulary learning: A survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 7, pp. 5092–5113, 2024.
- S. Koch, N. Vaskevicius, M. Colosi, P. Hermosilla, and T. Ropinski, “Open3dsg: Open-vocabulary 3d scene graphs from point clouds with queryable objects and open-set relationships,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 14 183–14 193.
- A. Takmaz, E. Fedele, R. W. Sumner, M. Pollefeys, F. Tombari, and F. Engelmann, “Openmask3d: Open-vocabulary 3d instance segmentation,” arXiv preprint arXiv:2306.13631, 2023.
- D. Maggio, Y. Chang, N. Hughes, M. Trang, D. Griffith, C. Dougherty, E. Cristofalo, L. Schmid, and L. Carlone, “Clio: Real-time task-driven open-set 3d scene graphs,” IEEE Robotics and Automation Letters, vol. 9, no. 10, pp. 8921–8928, 2024.
- A. Werby, C. Huang, M. Büchner, A. Valada, and W. Burgard, “Hierarchical open-vocabulary 3d scene graphs for language-grounded robot navigation,” in First Workshop on Vision-Language Models for Navigation and Manipulation at ICRA 2024, 2024.
- K. Rana, J. Haviland, S. Garg, J. Abou-Chakra, I. Reid, and N. Suenderhauf, “Sayplan: Grounding large language models using 3d scene graphs for scalable robot task planning,” in 7th Annual Conference on Robot Learning, 2023.
- A. Rajvanshi, K. Sikka, X. Lin, B. Lee, H.-P. Chiu, and A. Velasquez, “Saynav: Grounding large language models for dynamic planning to navigation in new environments,” in Proceedings of the International Conference on Automated Planning and Scheduling, vol. 34, 2024, pp. 464–474.
- P. Wu, Y. Mu, B. Wu, Y. Hou, J. Ma, S. Zhang, and C. Liu, “Voronav: Voronoi-based zero-shot object navigation with large language model,” arXiv preprint arXiv:2401.02695, 2024.
- D. Honerkamp, M. Büchner, F. Despinoy, T. Welschehold, and A. Valada, “Language-grounded dynamic scene graphs for interactive object search with mobile manipulation,” IEEE Robotics and Automation Letters, vol. 9, no. 10, pp. 8298–8305, 2024.
- I. Singh, V. Blukis, A. Mousavian, A. Goyal, D. Xu, J. Tremblay, D. Fox, J. Thomason, and A. Garg, “Progprompt: Generating situated robot task plans using large language models,” in 2023 IEEE International Conference on Robotics and Automation (ICRA), 2023, pp. 11 523–11 530.
- J. Liang, W. Huang, F. Xia, P. Xu, K. Hausman, B. Ichter, P. Florence, and A. Zeng, “Code as policies: Language model programs for embodied control,” in 2023 IEEE International Conference on Robotics and Automation (ICRA), 2023, pp. 9493–9500.
- W. Huang, F. Xia, T. Xiao, H. Chan, J. Liang, P. Florence, A. Zeng, J. Tompson, I. Mordatch, Y. Chebotar, et al., “Inner monologue: Embodied reasoning through planning with language models,” arXiv preprint arXiv:2207.05608, 2022.
- Q. Gu, A. Kuwajerwala, S. Morin, K. M. Jatavallabhula, B. Sen, A. Agarwal, C. Rivera, W. Paul, K. Ellis, R. Chellappa, et al., “Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning,” in 2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 5021–5028.
- M. Ahn, A. Brohan, N. Brown, Y. Chebotar, O. Cortes, B. David, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, et al., “Do as I can, not as I say: Grounding language in robotic affordances,” in Proceedings of the Conference on Robot Learning, 2022.
- C. Huang, O. Mees, A. Zeng, and W. Burgard, “Visual language maps for robot navigation,” in 2023 IEEE International Conference on Robotics and Automation (ICRA), 2023, pp. 10 608–10 615.
- S.-C. Wu, J. Wald, K. Tateno, N. Navab, and F. Tombari, “Scenegraphfusion: Incremental 3d scene graph prediction from rgb-d sequences,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 7515–7525.
- Y. Deng, J. Wang, J. Zhao, X. Tian, G. Chen, Y. Yang, and Y. Yue, “Opengraph: Open-vocabulary hierarchical 3d graph representation in large-scale outdoor environments,” arXiv preprint arXiv:2403.09412, 2024.
- J. Kerr, C. M. Kim, K. Goldberg, A. Kanazawa, and M. Tancik, “LERF: language embedded radiance fields,” in Proceedings of the International Conference on Computer Vision, Paris, France, October 2023, pp. 19 672–19 682.
- R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, et al., “Visual genome: Connecting language and vision using crowdsourced dense image annotations,” International journal of computer vision, vol. 123, pp. 32–73, 2017.
- B. Schroeder and S. Tripathi, “Structured query-based image retrieval using scene graphs,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, 2020, pp. 178–179.
- J. Johnson, R. Krishna, M. Stark, L.-J. Li, D. Shamma, M. Bernstein, and L. Fei-Fei, “Image retrieval using scene graphs,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3668–3678.
- Y. Zhu, J. Tremblay, S. Birchfield, and Y. Zhu, “Hierarchical planning for long-horizon manipulation with geometric and symbolic scene graphs,” in 2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2021, pp. 6541–6548.
- M. Xu, M. Qu, B. Ni, and J. Tang, “Joint modeling of visual objects and relations for scene graph generation,” Advances in Neural Information Processing Systems, vol. 34, pp. 7689–7702, 2021.
- M. Qi, W. Li, Z. Yang, Y. Wang, and J. Luo, “Attentive relational networks for mapping images to scene graphs,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 3957–3966.
- J. Yang, J. Lu, S. Lee, D. Batra, and D. Parikh, “Graph r-cnn for scene graph generation,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 670–685.
- A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” in Proceedings of the International Conference on Machine Learning, vol. 139, Vienna, Austria, July 2021, pp. 8748–8763.
- J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al., “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023.
- H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction tuning,” in Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, December 2023.
- H. Liu, C. Li, Y. Li, and Y. J. Lee, “Improved baselines with visual instruction tuning,” in Proceedings of the International Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, June 2024, pp. 26 296–26 306.
- J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,” in International conference on machine learning. PMLR, 2023, pp. 19 730–19 742.
- N. Hughes, Y. Chang, and L. Carlone, “Hydra: A real-time spatial perception system for 3D scene graph construction and optimization,” Robotics: Science and Systems (RSS), 2022.
- A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W. Lo, P. Dollár, and R. B. Girshick, “Segment anything,” in Proceedings of the International Conference on Computer Vision, Paris, France, October 2023, pp. 3992–4003.
- I. Armeni, Z.-Y. He, J. Gwak, A. R. Zamir, M. Fischer, J. Malik, and S. Savarese, “3d scene graph: A structure for unified semantics, 3d space, and camera,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 5664–5673.
- A. Rosinol, A. Gupta, M. Abate, J. Shi, and L. Carlone, “3D Dynamic Scene Graphs: Actionable Spatial Perception with Places, Objects, and Humans,” in Proceedings of Robotics: Science and Systems, Corvalis, Oregon, USA, July 2020.
- J. Huang, S. Yong, X. Ma, X. Linghu, P. Li, Y. Wang, Q. Li, S. Zhu, B. Jia, and S. Huang, “An embodied generalist agent in 3d world,” in Proceedings of the International Conference on Machine Learning, ICML, Vienna, Austria, July 2024.
- L. Xia, J. Cui, R. Shen, X. Xu, Y. Gao, and X. Li, “A survey of image semantics-based visual simultaneous localization and mapping: Application-oriented solutions to autonomous navigation of mobile robots,” International Journal of Advanced Robotic Systems, vol. 17, no. 3, 2020.
- W. Xu, M. Liu, O. Sokolsky, I. Lee, and F. Kong, “Llm-enabled cyber-physical systems: Survey, research opportunities, and challenges,” in IEEE International Workshop on Foundation Models for Cyber-Physical Systems & Internet of Things (FMSys), Hong Kong, China, May 2024, pp. 50–55.
- K. Bousmalis, G. Vezzani, D. Rao, C. Devin, A. X. Lee, M. Bauza, T. Davchev, Y. Zhou, A. Gupta, A. Raju, et al., “Robocat: A self-improving foundation agent for robotic manipulation,” arXiv preprint arXiv:2306.11706, 2023.
- H. Huang, Z. Wang, R. Huang, L. Liu, X. Cheng, Y. Zhao, T. Jin, and Z. Zhao, “Chat-scene: Bridging 3d scene and large language models with object identifiers,” in Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, December 2024.
- Z. Zou, K. Chen, Z. Shi, Y. Guo, and J. Ye, “Object detection in 20 years: A survey,” Proceedings of the IEEE, vol. 111, no. 3, pp. 257–276, 2023.
- S. Minaee, Y. Boykov, F. Porikli, A. Plaza, N. Kehtarnavaz, and D. Terzopoulos, “Image segmentation using deep learning: A survey,” IEEE transactions on pattern analysis and machine intelligence, vol. 44, no. 7, pp. 3523–3542, 2021.
- H. Fu, N. Patel, P. Krishnamurthy, and F. Khorrami, “Clipscope: Enhancing zero-shot ood detection with bayesian scoring,” arXiv e-prints, pp. arXiv–2405, 2024.
- M. Xu, Z. Zhang, F. Wei, Y. Lin, Y. Cao, H. Hu, and X. Bai, “A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model,” in Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, October 2022, pp. 736–753.
- H. Yuan, X. Li, C. Zhou, Y. Li, K. Chen, and C. C. Loy, “Open-vocabulary SAM: segment and recognize twenty-thousand classes interactively,” in Proceedings of the European Conference on Computer Vision, Milan, Italy, September 2024.
- J. Chen, Q. Yu, X. Shen, A. L. Yuille, and L. Chen, “Vitamin: Designing scalable vision models in the vision-language era,” in Proceedings of the International Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, June 2024, pp. 12 954–12 966.
- B. Xie, J. Cao, J. Xie, F. S. Khan, and Y. Pang, “Sed: A simple encoder-decoder for open-vocabulary semantic segmentation,” in Proceedings of the International Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, June 2024, pp. 3426–3436.
- S. Cho, H. Shin, S. Hong, A. Arnab, P. H. Seo, and S. Kim, “CAT-seg: Cost aggregation for open-vocabulary semantic segmentation,” in Proceedings of the International Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, June 2024, pp. 4113–4123.
- W. Kang, G. Liu, M. Shah, and Y. Yan, “Segvg: Transferring object bounding box to segmentation for visual grounding,” in Proceedings of the European Conference on Computer Vision, Milan, Italy, September 2024.
- Y. Wang, R. Sun, N. Luo, Y. Pan, and T. Zhang, “Image-to-image matching via foundation models: A new perspective for open-vocabulary semantic segmentation,” in Proceedings of the International Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, June 2024, pp. 3952–3963.
- J. Wang and L. Ke, “Llm-seg: Bridging image segmentation and large language model reasoning,” in Proceedings of the International Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, June 2024, pp. 1765–1774.
- H. Shi, S. D. Dao, and J. Cai, “Llmformer: Large language model for open-vocabulary semantic segmentation,” International Journal of Computer Vision, August 2024.
- X. Ma, S. Yong, Z. Zheng, Q. Li, Y. Liang, S. Zhu, and S. Huang, “SQA3D: situated question answering in 3d scenes,” in Proceedings of the International Conference on Learning Representations, Kigali, Rwanda, May 2023.
- D. Azuma, T. Miyanishi, S. Kurita, and M. Kawanabe, “Scanqa: 3d question answering for spatial scene understanding,” in Proceedings of the Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, June 2022, pp. 19 107–19 117.
- P. Achlioptas, A. Abdelreheem, F. Xia, M. Elhoseiny, and L. J. Guibas, “Referit3d: Neural listeners for fine-grained 3d object identification in real-world scenes,” in Proceedings of the European Conference in Computer Vision, Glasgow, UK, August 2020, pp. 422–440.
- Y. Chen, S. Yang, H. Huang, T. Wang, R. Lyu, R. Xu, D. Lin, and J. Pang, “Grounded 3d-llm with referent tokens,” arXiv preprint arXiv:2405.10370, 2024.
- Y. Hong, H. Zhen, P. Chen, S. Zheng, Y. Du, Z. Chen, and C. Gan, “3d-llm: Injecting the 3d world into large language models,” in Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, December 2023.
- S. Chen, X. Chen, C. Zhang, M. Li, G. Yu, H. Fei, H. Zhu, J. Fan, and T. Chen, “Ll3da: Visual interactive instruction tuning for omni-3d understanding reasoning and planning,” in Proceedings of the Internation Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, June 2024, pp. 26 428–26 438.
- J. Yang, X. Chen, S. Qian, N. Madaan, M. Iyengar, D. F. Fouhey, and J. Chai, “Llm-grounder: Open-vocabulary 3d visual grounding with large language model as an agent,” in Proceedings of the International Conference on Robotics and Automation, Yokohama, Japan, May 2024, pp. 7694–7701.
- B. Xu, W. Li, D. Tzoumanikas, M. Bloesch, A. J. Davison, and S. Leutenegger, “Mid-fusion: Octree-based object-level multi-instance dynamic SLAM,” in Proceedings of the International Conference on Robotics and Automation, Montreal, QC, Canada, May 2019, pp. 5231–5237.
- M. Rünz, M. Buffier, and L. Agapito, “Maskfusion: Real-time recognition, tracking and reconstruction of multiple moving objects,” in Proceedings of the International Symposium on Mixed and Augmented Reality, D. Chu, J. L. Gabbard, J. Grubert, and H. Regenbrecht, Eds., Munich, Germany, October 2018, pp. 10–20.
- L. Nicholson, M. Milford, and N. Sünderhauf, “Quadricslam: Dual quadrics from object detections as landmarks in object-oriented SLAM,” IEEE Robotics and Automation Letters, vol. 4, no. 1, pp. 1–8, 2019.
- N. Patel, P. Krishnamurthy, and F. Khorrami, “Semantic segmentation guided slam using vision and lidar,” in Proceedings of the International Symposium on Robotics, Munich, German, June 2018, pp. 1–7.
- R. F. Salas-Moreno, R. A. Newcombe, H. Strasdat, P. H. J. Kelly, and A. J. Davison, “SLAM++: simultaneous localisation and mapping at the level of objects,” in Proceedings of the International Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, June 2013, pp. 1352–1359.
- X. Kong, S. Liu, M. Taher, and A. J. Davison, “vmap: Vectorised object mapping for neural field SLAM,” in Proceedings of the International Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, June 2023, pp. 952–961.
- N. Patel, F. Khorrami, P. Krishnamurthy, and A. Tzes, “Tightly coupled semantic RGB-D inertial odometry for accurate long-term localization and mapping,” in Proceedings of the International Conference on Advanced Robotics, Belo Horizonte, Brazil, December 2019, pp. 523–528.
- X. Han, H. Liu, Y. Ding, and L. Yang, “RO-MAP: real-time multi-object mapping with neural radiance fields,” IEEE Robotics and Automation Letters, vol. 8, no. 9, pp. 5950–5957, 2023.
- J. McCormac, R. Clark, M. Bloesch, A. J. Davison, and S. Leutenegger, “Fusion++: Volumetric object-level SLAM,” in Proceedings of the International Conference on 3D Vision, Verona, Italy, September 2018, pp. 32–41.
- R. Ding, J. Yang, C. Xue, W. Zhang, S. Bai, and X. Qi, “PLA: language-driven open-vocabulary 3d scene understanding,” in Proceedings of the International Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, June 2023, pp. 7010–7019.
- Z. Jin, M. Hayat, Y. Yang, Y. Guo, and Y. Lei, “Context-aware alignment and mutual masking for 3d-language pre-training,” in Proceedings of the International Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, June 2023, pp. 10 984–10 994.
- Z. Zhu, X. Ma, Y. Chen, Z. Deng, S. Huang, and Q. Li, “3d-vista: Pre-trained transformer for 3d vision and text alignment,” in Proceedings of the International Conference on Computer Vision, Paris, France, October 2023, pp. 2911–2921.
- D. Cai, L. Zhao, J. Zhang, L. Sheng, and D. Xu, “3djcg: A unified framework for joint dense captioning and visual grounding on 3d point clouds,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, June 2022, pp. 16 464–16 473.
- S. Chen, H. Zhu, M. Li, X. Chen, P. Guo, Y. Lei, Y. Gang, T. Li, and T. Chen, “Vote2cap-detr++: Decoupling localization and describing for end-to-end 3d dense captioning,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 11, pp. 7331–7347, 2024.
- D. Z. Chen, A. Gholami, M. Nießner, and A. X. Chang, “Scan2cap: Context-aware dense captioning in RGB-D scans,” in Proceedings of the International Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, June 2021, pp. 3193–3203.
- D. Z. Chen, A. X. Chang, and M. Nießner, “Scanrefer: 3d object localization in RGB-D scans using natural language,” in Proceedings of the European Conference on Computer Vision, Glasgow, UK, August 2020, pp. 202–221.
- J. Schult, F. Engelmann, A. Hermans, O. Litany, S. Tang, and B. Leibe, “Mask3d: Mask transformer for 3d semantic instance segmentation,” in Proceedings of the International Conference on Robotics and Automation, London, UK, May 2023, pp. 8216–8223.
- Y. Yue, S. Mahadevan, J. Schult, F. Engelmann, B. Leibe, K. Schindler, and T. Kontogianni, “AGILE3D: attention guided interactive multi-object 3d segmentation,” in Proceedings of the International Conference on Learning Representations, Vienna, Austria, May 2024.
- J. Zhou, J. Wang, B. Ma, Y.-S. Liu, T. Huang, and X. Wang, “Uni3d: Exploring unified 3d representation at scale,” in Proceedings of the International Conference on Learning Representations, Vienna, Austria, May 2024.
- H. Zhang, F. Li, and N. Ahuja, “Open-nerf: Towards open vocabulary nerf decomposition,” in Proceedings of the Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, January 2024, pp. 3444–3453.
- Y. Wang, H. Chen, and G. H. Lee, “Gov-nesf: Generalizable open-vocabulary neural semantic fields,” in Proceedings of the International Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, June 2024, pp. 20 443–20 453.
- T. Nguyen, A. Bourki, M. Macudzinski, A. Brunel, and M. Bennamoun, “Semantically-aware neural radiance fields for visual scene understanding: A comprehensive review,” CoRR, vol. abs/2402.11141, 2024.
- X. Chen, Z. Ma, X. Zhang, S. Xu, S. Qian, J. Yang, D. F. Fouhey, and J. Chai, “Multi-object hallucination in vision-language models,” in Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, December 2024.
- Q. Yu, J. He, X. Deng, X. Shen, and L. Chen, “Convolutions die hard: Open-vocabulary segmentation with single frozen convolutional CLIP,” in Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, December 2023.
- Z. Liu, H. Mao, C.-Y. Wu, C. Feichtenhofer, T. Darrell, and S. Xie, “A convnet for the 2020s,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 11 976–11 986.
- W. Cai, S. Huang, G. Cheng, Y. Long, P. Gao, C. Sun, and H. Dong, “Bridging zero-shot object navigation and foundation models through pixel-guided navigation skill,” in 2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 5228–5234.
- A. Z. Ren, J. Clark, A. Dixit, M. Itkina, A. Majumdar, and D. Sadigh, “Explore until confident: Efficient exploration for embodied question answering,” in arXiv preprint arXiv:2403.15941, 2024.
- P. Liu, Y. Orru, C. Paxton, N. M. M. Shafiullah, and L. Pinto, “Ok-robot: What really matters in integrating open-knowledge models for robotics,” arXiv preprint arXiv:2401.12202, 2024.
- B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Girdhar, “Masked-attention mask transformer for universal image segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022, pp. 1290–1299.
- S. Macenski and I. Jambrecic, “Slam toolbox: Slam for the dynamic world,” Journal of Open Source Software, vol. 6, no. 61, p. 2783, 2021. [Online]. Available: https://doi.org/10.21105/joss.02783
- M. Ester, H.-P. Kriegel, J. Sander, X. Xu, et al., “A density-based algorithm for discovering clusters in large spatial databases with noise,” in Proceedings of the International Conference on Knowledge Discovery and Data Mining, Portland, OR, USA, August 1996, pp. 226–231.
- H. W. Kuhn, “The hungarian method for the assignment problem,” Naval research logistics quarterly, vol. 2, no. 1-2, pp. 83–97, 1955.
- A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fan, et al., “The llama 3 herd of models,” arXiv preprint arXiv:2407.21783, 2024.
- B. Dai, R. Khorrambakht, P. Krishnamurthy, and F. Khorrami, “Sailing through point clouds: Safe navigation using point cloud based control barrier functions,” IEEE Robotics and Automation Letters (RA-L), 2024.
- J. Hörner. (2016) Map-merging for multi-robot system. Prague. [Online]. Available: https://is.cuni.cz/webapps/zzp/detail/174125/