AutoRT: Embodied Foundation Models for Large Scale Orchestration of Robotic Agents (2401.12963v2)
Abstract: Foundation models that incorporate language, vision, and more recently actions have revolutionized the ability to harness internet scale data to reason about useful tasks. However, one of the key challenges of training embodied foundation models is the lack of data grounded in the physical world. In this paper, we propose AutoRT, a system that leverages existing foundation models to scale up the deployment of operational robots in completely unseen scenarios with minimal human supervision. AutoRT leverages vision-LLMs (VLMs) for scene understanding and grounding, and further uses LLMs for proposing diverse and novel instructions to be performed by a fleet of robots. Guiding data collection by tapping into the knowledge of foundation models enables AutoRT to effectively reason about autonomy tradeoffs and safety while significantly scaling up data collection for robot learning. We demonstrate AutoRT proposing instructions to over 20 robots across multiple buildings and collecting 77k real robot episodes via both teleoperation and autonomous robot policies. We experimentally show that such "in-the-wild" data collected by AutoRT is significantly more diverse, and that AutoRT's use of LLMs allows for instruction following data collection robots that can align to human preferences.
- Do as i can and not as i say: Grounding language in robotic affordances. In arXiv preprint arXiv:2204.01691, 2022.
- How to prompt your robot: A promptbook for manipulation skills with code as policies. In 2nd Workshop on Language and Robot Learning: Language as Grounding, 2023. URL https://openreview.net/forum?id=T8AiZj1QdN.
- Isaac Asimov. Runaround. Street & Smith, 1942.
- Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022.
- Robocat: A self-improving foundation agent for robotic manipulation, 2023.
- Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022.
- Rt-2: Vision-language-action models transfer web knowledge to robotic control. In arXiv preprint arXiv:2307.15818, 2023.
- Universal sentence encoder. In In submission to: EMNLP demonstration, Brussels, Belgium, 2018. URL https://arxiv.org/abs/1803.11175. In submission.
- Open-vocabulary queryable scene representations for real world planning. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 11509–11522. IEEE, 2023.
- Pali: A jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794, 2022.
- Leveraging procedural generation to benchmark reinforcement learning. In International conference on machine learning, pp. 2048–2056. PMLR, 2020.
- Robonet: Large-scale multi-robot learning, 2020.
- Palm-e: An embodied multimodal language model. In arXiv preprint arXiv:2303.03378, 2023.
- Visual foresight: Model-based deep reinforcement learning for vision-based robotic control, 2018.
- Physically grounded vision-language models for robotic manipulation, 2023.
- Robot learning in homes: Improving generalization and reducing dataset bias, 2018.
- Fleet-dagger: Interactive robot fleet learning with scalable human supervision, 2022.
- BC-z: Zero-shot task generalization with robotic imitation learning. In 5th Annual Conference on Robot Learning, 2021. URL https://openreview.net/forum?id=8kbp23tSGYv.
- Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation. CoRR, abs/1806.10293, 2018. URL http://arxiv.org/abs/1806.10293.
- Mt-opt: Continuous multi-task robotic reinforcement learning at scale, 2021.
- Hg-dagger: Interactive imitation learning with human experts. In 2019 International Conference on Robotics and Automation (ICRA), pp. 8077–8083. IEEE, 2019.
- Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection, 2016.
- Code as policies: Language model programs for embodied control. In arXiv preprint arXiv:2209.07753, 2022.
- Interactive language: Talking to robots in real time. IEEE Robotics and Automation Letters, 2023.
- Scaling robot supervision to hundreds of hours with roboturk: Robotic manipulation dataset through human reasoning and dexterity. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1048–1055. IEEE, 2019.
- Grounding language with visual affordances over unstructured data. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 11576–11582. IEEE, 2023.
- Generative agents: Interactive simulacra of human behavior. arXiv preprint arXiv:2304.03442, 2023.
- Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours, 2015.
- Robert Platt. Grasp learning: Models, methods, and performance, 2022.
- Sayplan: Grounding large language models using 3d scene graphs for scalable task planning. arXiv preprint arXiv:2307.06135, 2023.
- Under Review. Flexcap: Generating rich, localized, and flexible captions in images. 2023.
- A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 627–635. JMLR Workshop and Conference Proceedings, 2011.
- Multiple interactions made easy (mime): Large scale demonstrations data for imitation, 2018.
- Reflexion: Language agents with verbal reinforcement learning. arXiv preprint arXiv:2303.11366, 2023.
- D4: Improving llm pretraining via document de-duplication and diversification. In Proceedings of the 40 th International Conference on Machine Learning, 2023.
- Chatgpt for robotics: Design principles and model abilities. Microsoft Auton. Syst. Robot. Res, 2:20, 2023.
- Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv: Arxiv-2305.16291, 2023.
- Towards a foundation model for generalist robots: Diverse skill learning at scale via automated task and scene generation. arXiv preprint arXiv:2305.10455, 2023.
- Robotic skill acquistion via instruction augmentation with vision-language models. In Proceedings of Robotics: Science and Systems, 2023.
- React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022.
- More than a million ways to be pushed: A high-fidelity experimental dataset of planar pushing, 2016.