Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Task and Motion Planning with Large Language Models for Object Rearrangement (2303.06247v4)

Published 10 Mar 2023 in cs.RO

Abstract: Multi-object rearrangement is a crucial skill for service robots, and commonsense reasoning is frequently needed in this process. However, achieving commonsense arrangements requires knowledge about objects, which is hard to transfer to robots. LLMs are one potential source of this knowledge, but they do not naively capture information about plausible physical arrangements of the world. We propose LLM-GROP, which uses prompting to extract commonsense knowledge about semantically valid object configurations from an LLM and instantiates them with a task and motion planner in order to generalize to varying scene geometry. LLM-GROP allows us to go from natural-language commands to human-aligned object rearrangement in varied environments. Based on human evaluations, our approach achieves the highest rating while outperforming competitive baselines in terms of success rate while maintaining comparable cumulative action costs. Finally, we demonstrate a practical implementation of LLM-GROP on a mobile manipulator in real-world scenarios. Supplementary materials are available at: https://sites.google.com/view/LLM-grop

Task and Motion Planning with LLMs for Object Rearrangement

This paper introduces LLM-GROP, a method that combines LLMs with task and motion planning (TAMP) for semantically valid object rearrangement tasks performed by service robots. The primary objective is to leverage the commonsense reasoning capabilities of LLMs to perform tableware object arrangements based on semantically valid configurations, addressing deficiencies in current robotic systems that often struggle with such high-level reasoning tasks.

Methodology Overview

LLM-GROP is designed to bridge the gap between natural language processing and robotic task execution by utilizing LLMs to infer spatial relationships among objects and employing task and motion planning to execute object rearrangements. The methodology consists of two main components:

  1. Symbolic Spatial Relationships: The method employs LLMs to extract symbolic spatial relationships between objects through a structured prompting technique. This involves a predefined template to extract relationships like "to the left of" or "on top of." To ensure logical consistency and avoid contradictory arrangements, logical reasoning is integrated using Answer Set Programming (ASP) for recursive reasoning and verification of logical constraints.
  2. Geometric Spatial Relationships: After establishing symbolic relationships, LLM-GROP generates feasible geometric configurations based on these symbolic instructions. This is achieved through Gaussian sampling and rejection sampling techniques, ensuring the sampled positions respect constraints such as non-overlapping objects and staying within table boundaries.
  3. Task-Motion Planning: Once geometric configurations are available, LLM-GROP utilizes TAMP to compute efficient and feasible navigation and manipulation plans. This involves determining optimal navigation goals and executing tasks to maximize long-term utility, considering the feasibility and efficiency of rearrangement plans.

Experimental Results

The evaluation of LLM-GROP involves comparing it to three baselines across a variety of object rearrangement tasks. The baselines range from simple task planning with random arrangements to more sophisticated approaches like GROP. Key findings from experiments indicate that LLM-GROP consistently achieves higher user ratings for arrangement quality while maintaining or improving task execution efficiency. This demonstrates the advantage of integrating LLM-derived commonsense knowledge with robotic planning.

Implications and Future Directions

The LLM-GROP framework highlights the potential for LLMs to address challenges in robotic task planning by providing valuable commonsense reasoning capabilities. By integrating these models with traditional robotic techniques, robots can effectively perform complex tasks that require human-like understanding of object relationships and spatial arrangements.

The successful demonstration on both simulated and real-world platforms underscores the practical viability of LLM-GROP. As LLMs continue to evolve, their application in robotics could be expanded to encompass a wider range of domains, potentially improving robots' ability to autonomously handle more complex and dynamic environments. Future work could focus on integrating perception-based methods with LLM-GROP to handle unknown objects and environments, extending the model's predictive capabilities beyond predefined scenarios.

In conclusion, this research sets a foundation for further exploration into the intersection between LLMs and robotics, offering a promising approach to enhance robots' ability to execute tasks requiring high-level reasoning and adaptability in diverse contexts.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. A. Szot, K. Yadav, A. Clegg, V.-P. Berges, A. Gokaslan, A. Chang, M. Savva, Z. Kira, and D. Batra, “Habitat rearrangement challenge 2022,” https://aihabitat.org/challenge/rearrange_2022, 2022.
  2. L. Weihs, M. Deitke, A. Kembhavi, and R. Mottaghi, “Visual room rearrangement,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2021.
  3. W. Goodwin, S. Vaze, I. Havoutis, and I. Posner, “Semantically grounded object matching for robust robotic scene rearrangement,” in 2022 International Conference on Robotics and Automation (ICRA).   IEEE, 2022, pp. 11 138–11 144.
  4. W. Liu, C. Paxton, T. Hermans, and D. Fox, “Structformer: Learning spatial structure for language-guided semantic rearrangement of novel objects,” in 2022 International Conference on Robotics and Automation (ICRA).   IEEE, 2022, pp. 6322–6329.
  5. Q. A. Wei, S. Ding, J. J. Park, R. Sajnani, A. Poulenard, S. Sridhar, and L. Guibas, “Lego-net: Learning regular rearrangements of objects in rooms,” arXiv preprint arXiv:2301.09629, 2023.
  6. E. Huang, Z. Jia, and M. T. Mason, “Large-scale multi-object rearrangement,” in 2019 International Conference on Robotics and Automation (ICRA).   IEEE, 2019, pp. 211–218.
  7. J. Gu, D. S. Chaplot, H. Su, and J. Malik, “Multi-skill mobile manipulation for object rearrangement,” arXiv preprint arXiv:2209.02778, 2022.
  8. J. E. King, M. Cognetti, and S. S. Srinivasa, “Rearrangement planning using object-centric and robot-centric action spaces,” in 2016 ICRA, pp. 3940–3947.
  9. S. H. Cheong, B. Y. Cho, J. Lee, C. Kim, and C. Nam, “Where to relocate?: Object rearrangement inside cluttered and confined environments for robotic manipulation,” in 2020 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2020, pp. 7791–7797.
  10. V. Vasilopoulos, Y. Kantaros, G. J. Pappas, and D. E. Koditschek, “Reactive planning for mobile manipulation tasks in unexplored semantic environments,” in 2021 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2021, pp. 6385–6392.
  11. J. Liang, W. Huang, F. Xia, P. Xu, K. Hausman, B. Ichter, P. Florence, and A. Zeng, “Code as policies: Language model programs for embodied control,” arXiv preprint arXiv:2209.07753, 2022.
  12. T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
  13. OpenAI, “Chatgpt,” Accessed: 2023-02-08, 2023, cit. on pp. 1, 16. [Online]. Available: https://openai.com/blog/chatgpt/
  14. P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, and G. Neubig, “Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing,” arXiv preprint arXiv:2107.13586, 2021.
  15. W. Liu, T. Hermans, S. Chernova, and C. Paxton, “Structdiffusion: Object-centric diffusion for semantic rearrangement of novel objects,” arXiv preprint arXiv:2211.04604, 2022.
  16. Y. Zhang and J. Chai, “Hierarchical task learning from language instructions with unified transformers and self-monitoring,” arXiv preprint arXiv:2106.03427, 2021.
  17. X. Zhang, Y. Zhu, Y. Ding, Y. Zhu, P. Stone, and S. Zhang, “Visually grounded task and motion planning for mobile manipulation,” arXiv preprint arXiv:2202.10667, 2022.
  18. M. Shridhar, J. Thomason, D. Gordon, Y. Bisk, W. Han, R. Mottaghi, L. Zettlemoyer, and D. Fox, “Alfred: A benchmark for interpreting grounded instructions for everyday tasks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 10 740–10 749.
  19. V. Blukis, C. Paxton, D. Fox, A. Garg, and Y. Artzi, “A persistent spatial semantic representation for high-level natural language instruction execution,” in Conference on Robot Learning.   PMLR, 2022, pp. 706–717.
  20. S. Y. Min, D. S. Chaplot, P. Ravikumar, Y. Bisk, and R. Salakhutdinov, “Film: Following instructions in language with modular methods,” arXiv preprint arXiv:2110.07342, 2021.
  21. Y. Inoue and H. Ohashi, “Prompter: Utilizing large language model prompting for a data efficient embodied instruction following,” arXiv preprint arXiv:2211.03267, 2022.
  22. I. Kapelyukh, V. Vosylius, and E. Johns, “Dall-e-bot: Introducing web-scale diffusion models to robotics,” arXiv preprint arXiv:2210.02438, 2022.
  23. W. Liu, C. Paxton, T. Hermans, and D. Fox, “Structformer: Learning spatial structure for language-guided semantic rearrangement of novel objects,” arXiv preprint arXiv:2110.10189, 2021.
  24. A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchical text-conditional image generation with clip latents,” arXiv preprint arXiv:2204.06125, 2022.
  25. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
  26. M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, et al., “Evaluating large language models trained on code,” arXiv preprint arXiv:2107.03374, 2021.
  27. S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V. Lin, et al., “Opt: Open pre-trained transformer language models,” arXiv preprint arXiv:2205.01068, 2022.
  28. Y. Kant, A. Ramachandran, S. Yenamandra, I. Gilitschenski, D. Batra, A. Szot, and H. Agrawal, “Housekeep: Tidying virtual households using commonsense reasoning,” arXiv preprint arXiv:2205.10712, 2022.
  29. W. Huang, P. Abbeel, D. Pathak, and I. Mordatch, “Language models as zero-shot planners: Extracting actionable knowledge for embodied agents,” Thirty-ninth International Conference on Machine Learning, 2022.
  30. M. Ahn, A. Brohan, N. Brown, Y. Chebotar, O. Cortes, B. David, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, et al., “Do as i can and not as i say: Grounding language in robotic affordances,” arXiv preprint arXiv:2204.01691, 2022.
  31. W. Huang, F. Xia, T. Xiao, H. Chan, J. Liang, P. Florence, A. Zeng, J. Tompson, I. Mordatch, Y. Chebotar, P. Sermanet, N. Brown, T. Jackson, L. Luu, S. Levine, K. Hausman, and B. Ichter, “Inner monologue: Embodied reasoning through planning with language models,” in arXiv preprint arXiv:2207.05608, 2022.
  32. I. Singh, V. Blukis, A. Mousavian, A. Goyal, D. Xu, J. Tremblay, D. Fox, J. Thomason, and A. Garg, “Progprompt: Generating situated robot task plans using large language models,” arXiv preprint arXiv:2209.11302, 2022.
  33. Y. Ding, X. Zhang, S. Amiri, N. Cao, H. Yang, A. Kaminski, C. Esselink, and S. Zhang, “Integrating action knowledge and llms for task planning and situation handling in open worlds,” arXiv preprint arXiv:2305.17590, 2023.
  34. B. Liu, Y. Jiang, X. Zhang, Q. Liu, S. Zhang, J. Biswas, and P. Stone, “Llm+ p: Empowering large language models with optimal planning proficiency,” arXiv preprint arXiv:2304.11477, 2023.
  35. Z. Zhao, W. S. Lee, and D. Hsu, “Large language models as commonsense knowledge for large-scale task planning,” arXiv preprint arXiv:2305.14078, 2023.
  36. J. Wu, R. Antonova, A. Kan, M. Lepert, A. Zeng, S. Song, J. Bohg, S. Rusinkiewicz, and T. Funkhouser, “Tidybot: Personalized robot assistance with large language models,” arXiv preprint arXiv:2305.05658, 2023.
  37. K. Rana, J. Haviland, S. Garg, J. Abou-Chakra, I. Reid, and N. Suenderhauf, “Sayplan: Grounding large language models using 3d scene graphs for scalable task planning,” arXiv preprint arXiv:2307.06135, 2023.
  38. M. Gebser, R. Kaminski, B. Kaufmann, M. Ostrowski, T. Schaub, and S. Thiele, “A user’s guide to gringo, clasp, clingo, and iclingo,” 2008.
  39. Y.-q. Jiang, S.-q. Zhang, P. Khandelwal, and P. Stone, “Task planning in robotics: an empirical comparison of pddl-and asp-based systems,” Frontiers of Information Technology & Electronic Engineering, vol. 20, no. 3, pp. 363–373, 2019.
  40. V. Boor, M. H. Overmars, and A. F. Van Der Stappen, “The gaussian sampling strategy for probabilistic roadmap planners,” in Proceedings 1999 IEEE International Conference on Robotics and Automation (Cat. No. 99CH36288C), vol. 2.   IEEE, 1999, pp. 1018–1023.
  41. W. R. Gilks and P. Wild, “Adaptive rejection sampling for gibbs sampling,” Journal of the Royal Statistical Society: Series C (Applied Statistics), vol. 41, no. 2, pp. 337–348, 1992.
  42. A. Curtis, X. Fang, L. P. Kaelbling, T. Lozano-Pérez, and C. R. Garrett, “Long-horizon manipulation of unknown objects via task and motion planning with estimated affordances,” in 2022 International Conference on Robotics and Automation (ICRA).   IEEE, 2022, pp. 1940–1946.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Yan Ding (41 papers)
  2. Xiaohan Zhang (78 papers)
  3. Chris Paxton (59 papers)
  4. Shiqi Zhang (88 papers)
Citations (143)
Youtube Logo Streamline Icon: https://streamlinehq.com