LoTa-Bench: Benchmarking Language-oriented Task Planners for Embodied Agents (2402.08178v1)
Abstract: LLMs have recently received considerable attention as alternative solutions for task planning. However, comparing the performance of language-oriented task planners becomes difficult, and there exists a dearth of detailed exploration regarding the effects of various factors such as pre-trained model selection and prompt construction. To address this, we propose a benchmark system for automatically quantifying performance of task planning for home-service embodied agents. Task planners are tested on two pairs of datasets and simulators: 1) ALFRED and AI2-THOR, 2) an extension of Watch-And-Help and VirtualHome. Using the proposed benchmark system, we perform extensive experiments with LLMs and prompts, and explore several enhancements of the baseline planner. We expect that the proposed benchmark tool would accelerate the development of language-oriented task planners.
- Do as i can, not as i say: Grounding language in robotic affordances. In Proceedings of The 6th Conference on Robot Learning, pp. 287–318. PMLR, 2023.
- GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow, March 2021. URL https://doi.org/10.5281/zenodo.5297715.
- Gpt-neox-20b: An open-source autoregressive language model. In Proceedings of BigScience Episode #5 – Workshop on Challenges & Perspectives in Creating Large Language Models, pp. 95–136, 2022.
- RT-1: Robotics Transformer for Real-World Control at Scale. In Proceedings of Robotics: Science and Systems, Daegu, Republic of Korea, July 2023.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pp. 1877–1901, 2020.
- Open-vocabulary queryable scene representations for real world planning. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 11509–11522. IEEE, 2023.
- Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023.
- A survey of embodied ai: From simulators to research tasks. IEEE Transactions on Emerging Topics in Computational Intelligence, 6(2):230–244, 2022. doi: 10.1109/TETCI.2022.3141105.
- Search on the replay buffer: Bridging planning and reinforcement learning. Advances in Neural Information Processing Systems, 32, 2019.
- Strips: A new approach to the application of theorem proving to problem solving. Artificial intelligence, 2(3-4):189–208, 1971.
- Pddlstream: Integrating symbolic planners and blackbox samplers via optimistic adaptive planning. In Proceedings of the International Conference on Automated Planning and Scheduling, volume 30, pp. 440–448, 2020.
- Jörg Hoffmann. Ff: The fast-forward planning system. AI magazine, 22(3):57–57, 2001.
- LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2021.
- Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. In Proceedings of the 39th International Conference on Machine Learning, pp. 9118–9147. PMLR, 2022.
- Inner monologue: Embodied reasoning through planning with language models. In Proceedings of The 6th Conference on Robot Learning, pp. 1769–1782. PMLR, 2023.
- Broadly-exploring, local-policy trees for long-horizon task planning. In Proceedings of the 5th Conference on Robot Learning, pp. 59–69. PMLR, 2022.
- Bc-z: Zero-shot task generalization with robotic imitation learning. In Proceedings of the 5th Conference on Robot Learning, pp. 991–1002. PMLR, 2022.
- Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:1712.05474, 2017.
- Behavior-1k: A benchmark for embodied ai with 1,000 everyday activities and realistic simulation. In Proceedings of The 6th Conference on Robot Learning, pp. 80–93. PMLR, 2023.
- Pre-trained language models for interactive decision-making. In Advances in Neural Information Processing Systems, volume 35, pp. 31199–31212, 2022.
- Code as policies: Language model programs for embodied control. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 9493–9500. IEEE, 2023a.
- Holistic evaluation of language models. Transactions on Machine Learning Research, 2023b.
- On grounded planning for embodied tasks with language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pp. 13192–13200, 2023.
- Microsoft. Guidance. https://github.com/guidance-ai/guidance, 2023.
- Hierarchical foresight: Self-supervised learning of long-horizon tasks via visual subgoal generation. In International Conference on Learning Representations, 2020.
- OpenAI. GPT API. https://platform.openai.com/docs/api-reference, 2023. [Online; accessed 19-September-2023].
- Virtualhome: Simulating household activities via programs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8494–8502, 2018.
- Watch-and-help: A challenge for social perception and human-ai collaboration. In International Conference on Learning Representations, 2021.
- Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2019.
- Value function spaces: Skill-centric state abstractions for long-horizon reasoning. In International Conference on Learning Representations, 2022.
- Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10740–10749, 2020.
- ALFWorld: Aligning text and embodied environments for interactive learning. In International Conference on Learning Representations, 2021.
- Learning neuro-symbolic skills for bilevel planning. In Conference on Robot Learning, pp. 701–714. PMLR, 2023.
- Progprompt: Generating situated robot task plans using large language models. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 11523–11530. IEEE, 2023.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
- MosaicML NLP Team. Introducing mpt-30b: Raising the bar for open-source foundation models, 2023a. URL www.mosaicml.com/blog/mpt-30b. Accessed: 2023-06-22.
- MosaicML NLP Team. Introducing mpt-7b: A new standard for open-source, commercially usable llms, 2023b. URL www.mosaicml.com/blog/mpt-7b. Accessed: 2023-05-05.
- Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239, 2022.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
- GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax, May 2021.
- Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38–45. Association for Computational Linguistics, October 2020.
- Plan, eliminate, and track–language models are good teachers for embodied agents. arXiv preprint arXiv:2305.02412, 2023.
- Regression planning networks. Advances in Neural Information Processing Systems, 32, 2019.
- React: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations, 2023.
- Parsel: A unified natural language framework for algorithmic reasoning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.