Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Embodied Task Planning with Large Language Models (2307.01848v1)

Published 4 Jul 2023 in cs.CV, cs.AI, and cs.RO

Abstract: Equipping embodied agents with commonsense is important for robots to successfully complete complex human instructions in general environments. Recent LLMs (LLM) can embed rich semantic knowledge for agents in plan generation of complex tasks, while they lack the information about the realistic world and usually yield infeasible action sequences. In this paper, we propose a TAsk Planing Agent (TaPA) in embodied tasks for grounded planning with physical scene constraint, where the agent generates executable plans according to the existed objects in the scene by aligning LLMs with the visual perception models. Specifically, we first construct a multimodal dataset containing triplets of indoor scenes, instructions and action plans, where we provide the designed prompts and the list of existing objects in the scene for GPT-3.5 to generate a large number of instructions and corresponding planned actions. The generated data is leveraged for grounded plan tuning of pre-trained LLMs. During inference, we discover the objects in the scene by extending open-vocabulary object detectors to multi-view RGB images collected in different achievable locations. Experimental results show that the generated plan from our TaPA framework can achieve higher success rate than LLaVA and GPT-3.5 by a sizable margin, which indicates the practicality of embodied task planning in general and complex environments.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. Tidybot: Personalized robot assistance with large language models. arXiv preprint arXiv:2305.05658, 2023.
  2. Llava-med: Training a large language-and-vision assistant for biomedicine in one day. arXiv preprint arXiv:2306.00890, 2023.
  3. Chatcad+: Towards a universal and reliable interactive cad using llms. arXiv preprint arXiv:2305.15964, 2023.
  4. Pathasst: Redefining pathology through generative foundation ai assistant for pathology. arXiv preprint arXiv:2305.15072, 2023.
  5. Chatgpt for robotics: Design principles and model abilities. Microsoft Auton. Syst. Robot. Res, 2:20, 2023.
  6. How can llms transform the robotic design process? Nature Machine Intelligence, pages 1–4, 2023.
  7. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  8. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023.
  9. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277, 2023.
  10. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
  11. Cliport: What and where pathways for robotic manipulation. In Conference on Robot Learning, pages 894–906. PMLR, 2022.
  12. R3m: A universal visual representation for robot manipulation. arXiv preprint arXiv:2203.12601, 2022.
  13. Bc-z: Zero-shot task generalization with robotic imitation learning. In Conference on Robot Learning, pages 991–1002. PMLR, 2022.
  14. Do as i can, not as i say: Grounding language in robotic affordances. In Conference on Robot Learning, pages 287–318. PMLR, 2023.
  15. Llm-planner: Few-shot grounded planning for embodied agents with large language models. arXiv preprint arXiv:2212.04088, 2022.
  16. Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10740–10749, 2020.
  17. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023.
  18. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  19. J. D. M.-W. C. Kenton and L. K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of naacL-HLT, volume 1, page 2, 2019.
  20. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  21. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
  22. Bridging the gap between object and image-level representations for open-vocabulary detection. Advances in Neural Information Processing Systems, 35:33781–33794, 2022.
  23. Detecting twenty-thousand classes using image-level supervision. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part IX, pages 350–368. Springer, 2022.
  24. L. Floridi and M. Chiriatti. Gpt-3: Its nature, scope, limits, and consequences. Minds and Machines, 30:681–694, 2020.
  25. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
  26. Nlx-gpt: A model for natural language explanations in vision and vision-language tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8322–8332, 2022.
  27. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
  28. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761, 2023.
  29. M. Zong and B. Krishnamachari. Solving math word problems concerning systems of equations with gpt-3. In Proceedings of the Thirteenth AAAI Symposium on Educational Advances in Artificial Intelligence, 2022.
  30. Contextual object detection with multimodal large language models. arXiv preprint arXiv:2305.18279, 2023.
  31. Open-vocabulary image segmentation. arXiv preprint arXiv:2112.12143, 2021.
  32. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  33. Smart explorer: Recognizing objects in dense clutter via interactive exploration. In 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 6600–6607. IEEE, 2022.
  34. Ge-grasp: Efficient target-oriented grasping in dense clutter. In 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1388–1395. IEEE, 2022.
  35. Dspdet3d: Dynamic spatial pruning for 3d small object detection. arXiv preprint arXiv:2305.03716, 2023.
  36. Back to reality: Weakly-supervised 3d object detection with shape-guided label enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8438–8447, 2022.
  37. A persistent spatial semantic representation for high-level natural language instruction execution. In Conference on Robot Learning, pages 706–717. PMLR, 2022.
  38. Piglet: Language grounding through neuro-symbolic interaction in a 3d world. arXiv preprint arXiv:2106.00188, 2021.
  39. Grounding language to autonomously-acquired skills via goal generation. arXiv preprint arXiv:2006.07185, 2020.
  40. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. In International Conference on Machine Learning, pages 9118–9147. PMLR, 2022.
  41. Pre-trained language models for interactive decision-making. Advances in Neural Information Processing Systems, 35:31199–31212, 2022.
  42. Virtualhome: Simulating household activities via programs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8494–8502, 2018.
  43. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023.
  44. Macaw-llm: Multi-modal language modeling with image, audio, video, and text integration. arXiv preprint arXiv:2306.09093, 2023.
  45. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023.
  46. Assistgpt: A general multi-modal assistant that can plan, execute, inspect, and learn. arXiv preprint arXiv:2306.08640, 2023.
  47. Chatbridge: Bridging modalities with large language model as a language catalyst. arXiv preprint arXiv:2305.16103, 2023.
  48. X-llm: Bootstrapping advanced large language models by treating multi-modalities as foreign languages. arXiv preprint arXiv:2305.04160, 2023.
  49. Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:1712.05474, 2017.
  50. Site selection and layout of earthquake rescue center based on k-means clustering and fruit fly optimization algorithm. In 2020 IEEE International Conference on Artificial Intelligence and Computer Applications (ICAICA), pages 1381–1389. IEEE, 2020.
  51. X. Liu. The site selection of distribution center based on linear programming transportation method. In Proceedings of the 10th World Congress on Intelligent Control and Automation, pages 3538–3542. IEEE, 2012.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Zhenyu Wu (112 papers)
  2. Ziwei Wang (128 papers)
  3. Xiuwei Xu (16 papers)
  4. Jiwen Lu (192 papers)
  5. Haibin Yan (9 papers)
Citations (50)