Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LLMs for Robotic Object Disambiguation (2401.03388v1)

Published 7 Jan 2024 in cs.RO, cs.CL, and cs.LG

Abstract: The advantages of pre-trained LLMs are apparent in a variety of language processing tasks. But can a LLM's knowledge be further harnessed to effectively disambiguate objects and navigate decision-making challenges within the realm of robotics? Our study reveals the LLM's aptitude for solving complex decision making challenges that are often previously modeled by Partially Observable Markov Decision Processes (POMDPs). A pivotal focus of our research is the object disambiguation capability of LLMs. We detail the integration of an LLM into a tabletop environment disambiguation task, a decision making problem where the robot's task is to discern and retrieve a user's desired object from an arbitrarily large and complex cluster of objects. Despite multiple query attempts with zero-shot prompt engineering (details can be found in the Appendix), the LLM struggled to inquire about features not explicitly provided in the scene description. In response, we have developed a few-shot prompt engineering system to improve the LLM's ability to pose disambiguating queries. The result is a model capable of both using given features when they are available and inferring new relevant features when necessary, to successfully generate and navigate down a precise decision tree to the correct object--even when faced with identical options.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (32)
  1. I. Singh, V. Blukis, A. Mousavian, A. Goyal, D. Xu, J. Tremblay, D. Fox, J. Thomason, and A. Garg, “Progprompt: Generating situated robot task plans using large language models,” in 2023 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2023, pp. 11 523–11 530.
  2. S. Li, X. Puig, C. Paxton, Y. Du, C. Wang, L. Fan, T. Chen, D.-A. Huang, E. Akyürek, A. Anandkumar et al., “Pre-trained language models for interactive decision-making,” Advances in Neural Information Processing Systems, vol. 35, pp. 31 199–31 212, 2022.
  3. L. Chen, L. Wang, H. Dong, Y. Du, J. Yan, F. Yang, S. Li, P. Zhao, S. Qin, S. Rajmohan et al., “Introspective tips: Large language model for in-context decision making,” arXiv preprint arXiv:2305.11598, 2023.
  4. T. Hagendorff and S. Fabi, “Human-like intuitive behavior and reasoning biases emerged in language models–and disappeared in gpt-4,” arXiv preprint arXiv:2306.07622, 2023.
  5. B. Wang, R. Shin, X. Liu, O. Polozov, and M. Richardson, “Rat-sql: Relation-aware schema encoding and linking for text-to-sql parsers,” arXiv preprint arXiv:1911.04942, 2019.
  6. N. Kitaev, S. Cao, and D. Klein, “Multilingual constituency parsing with self-attention and pre-training,” arXiv preprint arXiv:1812.11760, 2018.
  7. J. Zhu, Y. Xia, L. Wu, D. He, T. Qin, W. Zhou, H. Li, and T.-Y. Liu, “Incorporating bert into neural machine translation,” arXiv preprint arXiv:2002.06823, 2020.
  8. T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
  9. N. S. Keskar, B. McCann, L. R. Varshney, C. Xiong, and R. Socher, “Ctrl: A conditional transformer language model for controllable generation,” arXiv preprint arXiv:1909.05858, 2019.
  10. H. Zhang, Y. Lu, C. Yu, D. Hsu, X. La, and N. Zheng, “Invigorate: Interactive visual grounding and grasping in clutter,” arXiv preprint arXiv:2108.11092, 2021.
  11. Y. Yang, X. Lou, and C. Choi, “Interactive robotic grasping with attribute-guided disambiguation,” in 2022 International Conference on Robotics and Automation (ICRA).   IEEE, 2022, pp. 8914–8920.
  12. Y. Li, C. Huang, X. Tang, and C. Change Loy, “Learning to disambiguate by asking discriminative questions,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 3419–3428.
  13. M. Shridhar, D. Mittal, and D. Hsu, “Ingress: Interactive visual grounding of referring expressions,” The International Journal of Robotics Research, vol. 39, no. 2-3, pp. 217–232, 2020.
  14. I. Lutkebohle, J. Peltason, L. Schillingmann, B. Wrede, S. Wachsmuth, C. Elbrechter, and R. Haschke, “The curious robot-structuring interactive robot learning,” in 2009 IEEE International Conference on Robotics and Automation.   IEEE, 2009, pp. 4156–4162.
  15. J. Mai, J. Chen, B. Li, G. Qian, M. Elhoseiny, and B. Ghanem, “Llm as a robotic brain: Unifying egocentric memory and control,” arXiv preprint arXiv:2304.09349, 2023.
  16. Y. Ding, X. Zhang, C. Paxton, and S. Zhang, “Task and motion planning with large language models for object rearrangement,” arXiv preprint arXiv:2303.06247, 2023.
  17. Y. Inoue and H. Ohashi, “Prompter: Utilizing large language model prompting for a data efficient embodied instruction following,” arXiv preprint arXiv:2211.03267, 2022.
  18. B. Liu, Y. Jiang, X. Zhang, Q. Liu, S. Zhang, J. Biswas, and P. Stone, “Llm+ p: Empowering large language models with optimal planning proficiency,” arXiv preprint arXiv:2304.11477, 2023.
  19. S. Tellexll, P. Thakerll, R. Deitsl, D. Simeonovl, T. Kollar, and N. Royl, “Toward information theoretic human-robot dialog,” Robotics, p. 409, 2013.
  20. J. Thomason, A. Padmakumar, J. Sinapov, N. Walker, Y. Jiang, H. Yedidsion, J. Hart, P. Stone, and R. J. Mooney, “Improving grounded natural language understanding through human-robot dialog,” in 2019 International Conference on Robotics and Automation (ICRA).   IEEE, 2019, pp. 6934–6941.
  21. Y. Yang, Y. Liu, H. Liang, X. Lou, and C. Choi, “Attribute-based robotic grasping with one-grasp adaptation,” in 2021 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2021, pp. 6357–6363.
  22. T. Silver, V. Hariprasad, R. S. Shuttleworth, N. Kumar, T. Lozano-Pérez, and L. P. Kaelbling, “Pddl planning with pretrained large language models,” in NeurIPS 2022 Foundation Models for Decision Making Workshop, 2022.
  23. J. Yang, H. Jin, R. Tang, X. Han, Q. Feng, H. Jiang, B. Yin, and X. Hu, “Harnessing the power of llms in practice: A survey on chatgpt and beyond,” arXiv preprint arXiv:2304.13712, 2023.
  24. H. Naveed, A. U. Khan, S. Qiu, M. Saqib, S. Anwar, M. Usman, N. Barnes, and A. Mian, “A comprehensive overview of large language models,” arXiv preprint arXiv:2307.06435, 2023.
  25. Z. Yin, Q. Sun, Q. Guo, J. Wu, X. Qiu, and X. Huang, “Do large language models know what they don’t know?” arXiv preprint arXiv:2305.18153, 2023.
  26. W. Huang, P. Abbeel, D. Pathak, and I. Mordatch, “Language models as zero-shot planners: Extracting actionable knowledge for embodied agents,” in International Conference on Machine Learning.   PMLR, 2022, pp. 9118–9147.
  27. B. Zhang and H. Soh, “Large language models as zero-shot human models for human-robot interaction,” arXiv preprint arXiv:2303.03548, 2023.
  28. T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa, “Large language models are zero-shot reasoners,” Advances in neural information processing systems, vol. 35, pp. 22 199–22 213, 2022.
  29. J. Wu, R. Antonova, A. Kan, M. Lepert, A. Zeng, S. Song, J. Bohg, S. Rusinkiewicz, and T. Funkhouser, “Tidybot: Personalized robot assistance with large language models,” arXiv preprint arXiv:2305.05658, 2023.
  30. M. Danielczuk, A. Kurenkov, A. Balakrishna, M. Matl, D. Wang, R. Martín-Martín, A. Garg, S. Savarese, and K. Goldberg, “Mechanical search: Multi-step retrieval of a target object occluded by clutter,” 2019 International Conference on Robotics and Automation (ICRA), pp. 1614–1621, 2019. [Online]. Available: https://api.semanticscholar.org/CorpusID:67877029
  31. A. Kurenkov, J. C. Taglic, R. Kulkarni, M. Dominguez-Kuhne, A. Garg, R. Mart’in-Mart’in, and S. Savarese, “Visuomotor mechanical search: Learning to retrieve target objects in clutter,” 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 8408–8414, 2020. [Online]. Available: https://api.semanticscholar.org/CorpusID:221135795
  32. H. Zhang, X. Lan, S. Bai, L. Wan, C. Yang, and N. Zheng, “A multi-task convolutional neural network for autonomous robotic grasping in object stacking scenes,” 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 6435–6442, 2018. [Online]. Available: https://api.semanticscholar.org/CorpusID:67855510
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Connie Jiang (1 paper)
  2. Yiqing Xu (33 papers)
  3. David Hsu (73 papers)