Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
158 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

AutoRT: Embodied Foundation Models for Large Scale Orchestration of Robotic Agents (2401.12963v2)

Published 23 Jan 2024 in cs.RO, cs.AI, cs.CL, cs.CV, and cs.LG

Abstract: Foundation models that incorporate language, vision, and more recently actions have revolutionized the ability to harness internet scale data to reason about useful tasks. However, one of the key challenges of training embodied foundation models is the lack of data grounded in the physical world. In this paper, we propose AutoRT, a system that leverages existing foundation models to scale up the deployment of operational robots in completely unseen scenarios with minimal human supervision. AutoRT leverages vision-LLMs (VLMs) for scene understanding and grounding, and further uses LLMs for proposing diverse and novel instructions to be performed by a fleet of robots. Guiding data collection by tapping into the knowledge of foundation models enables AutoRT to effectively reason about autonomy tradeoffs and safety while significantly scaling up data collection for robot learning. We demonstrate AutoRT proposing instructions to over 20 robots across multiple buildings and collecting 77k real robot episodes via both teleoperation and autonomous robot policies. We experimentally show that such "in-the-wild" data collected by AutoRT is significantly more diverse, and that AutoRT's use of LLMs allows for instruction following data collection robots that can align to human preferences.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. Do as i can and not as i say: Grounding language in robotic affordances. In arXiv preprint arXiv:2204.01691, 2022.
  2. How to prompt your robot: A promptbook for manipulation skills with code as policies. In 2nd Workshop on Language and Robot Learning: Language as Grounding, 2023. URL https://openreview.net/forum?id=T8AiZj1QdN.
  3. Isaac Asimov. Runaround. Street & Smith, 1942.
  4. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022.
  5. Robocat: A self-improving foundation agent for robotic manipulation, 2023.
  6. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022.
  7. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In arXiv preprint arXiv:2307.15818, 2023.
  8. Universal sentence encoder. In In submission to: EMNLP demonstration, Brussels, Belgium, 2018. URL https://arxiv.org/abs/1803.11175. In submission.
  9. Open-vocabulary queryable scene representations for real world planning. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pp.  11509–11522. IEEE, 2023.
  10. Pali: A jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794, 2022.
  11. Leveraging procedural generation to benchmark reinforcement learning. In International conference on machine learning, pp. 2048–2056. PMLR, 2020.
  12. Robonet: Large-scale multi-robot learning, 2020.
  13. Palm-e: An embodied multimodal language model. In arXiv preprint arXiv:2303.03378, 2023.
  14. Visual foresight: Model-based deep reinforcement learning for vision-based robotic control, 2018.
  15. Physically grounded vision-language models for robotic manipulation, 2023.
  16. Robot learning in homes: Improving generalization and reducing dataset bias, 2018.
  17. Fleet-dagger: Interactive robot fleet learning with scalable human supervision, 2022.
  18. BC-z: Zero-shot task generalization with robotic imitation learning. In 5th Annual Conference on Robot Learning, 2021. URL https://openreview.net/forum?id=8kbp23tSGYv.
  19. Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation. CoRR, abs/1806.10293, 2018. URL http://arxiv.org/abs/1806.10293.
  20. Mt-opt: Continuous multi-task robotic reinforcement learning at scale, 2021.
  21. Hg-dagger: Interactive imitation learning with human experts. In 2019 International Conference on Robotics and Automation (ICRA), pp.  8077–8083. IEEE, 2019.
  22. Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection, 2016.
  23. Code as policies: Language model programs for embodied control. In arXiv preprint arXiv:2209.07753, 2022.
  24. Interactive language: Talking to robots in real time. IEEE Robotics and Automation Letters, 2023.
  25. Scaling robot supervision to hundreds of hours with roboturk: Robotic manipulation dataset through human reasoning and dexterity. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp.  1048–1055. IEEE, 2019.
  26. Grounding language with visual affordances over unstructured data. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pp.  11576–11582. IEEE, 2023.
  27. Generative agents: Interactive simulacra of human behavior. arXiv preprint arXiv:2304.03442, 2023.
  28. Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours, 2015.
  29. Robert Platt. Grasp learning: Models, methods, and performance, 2022.
  30. Sayplan: Grounding large language models using 3d scene graphs for scalable task planning. arXiv preprint arXiv:2307.06135, 2023.
  31. Under Review. Flexcap: Generating rich, localized, and flexible captions in images. 2023.
  32. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp.  627–635. JMLR Workshop and Conference Proceedings, 2011.
  33. Multiple interactions made easy (mime): Large scale demonstrations data for imitation, 2018.
  34. Reflexion: Language agents with verbal reinforcement learning. arXiv preprint arXiv:2303.11366, 2023.
  35. D4: Improving llm pretraining via document de-duplication and diversification. In Proceedings of the 40 th International Conference on Machine Learning, 2023.
  36. Chatgpt for robotics: Design principles and model abilities. Microsoft Auton. Syst. Robot. Res, 2:20, 2023.
  37. Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv: Arxiv-2305.16291, 2023.
  38. Towards a foundation model for generalist robots: Diverse skill learning at scale via automated task and scene generation. arXiv preprint arXiv:2305.10455, 2023.
  39. Robotic skill acquistion via instruction augmentation with vision-language models. In Proceedings of Robotics: Science and Systems, 2023.
  40. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022.
  41. More than a million ways to be pushed: A high-fidelity experimental dataset of planar pushing, 2016.
Citations (33)

Summary

  • The paper presents AutoRT, a novel system leveraging LLMs to manage robotic fleets for autonomous, diverse in-the-wild data acquisition.
  • It integrates open vocabulary object detectors, LLM-based task suggestion, and a Robot Constitution that governs safe robot operations.
  • The system allows one human to supervise up to five robots, significantly boosting data diversity and deployment scalability.

Introduction

Autonomous robotics research is on a progressive path towards creating robotic agents that function without human intervention for a broad range of tasks. Despite the strides accomplished with robotic learning methods and the integration of advanced LLMs and VLMs, an autonomous system's capability to infer and perform open-ended tasks in various environments remains a significant hurdle. The crux of this challenge is data scarcity; operating in the physical world requires extensive real-world data experiences that surpass what is captured in controlled lab environments. The paper at hand introduces AutoRT, crafted to surmount this obstacle, aiming to autonomously orchestrate a fleet of robots for large-scale data acquisition.

Related Work

AutoRT builds upon prior work in autonomous data collection and the latest breakthroughs in language and vision models. Traditionally, autonomous data collection involved lab-bounded tasks, but more varied and less structured environments are now being considered. Teleoperated data collection remains valuable due to its diverse nature but is constrained by human resource limitations, setting the stage for a hybrid approach. Incorporating LLMs in agent behavior has been explored in other contexts, with this research being a direct progressive step that allows LLM-driven robots to execute auto-proposed goals in the real world.

Problem Statement

The work delineates the premise of a system designed for vast "in-the-wild" robotic data collection. With multiple robots navigating divergent environments, the aim is to produce a diverse and operationally efficient data accumulation mechanism. Addressing the constraints imposed by limited human supervision, each robot needs to be equipped to interpret its surrounding state and execute tasks through a spectrum of policies varying from full autonomy to teleoperation, ensuring wide-ranging data collection while managing tradeoffs between independence and safety.

AutoRT: Exploring and Executing in the Wild

AutoRT is a complex system utilizing a foundational model to choreograph robots by integrating user directives with environmental observables, generating tasks, and then carrying out the assigned tasks. Core components include an open vocabulary object detector for scene comprehension, LLMs used for task suggestion in line with high-level goals, and execution modalities determined by LLMs as well. A standout feature is the Robot Constitution – a set of operational guideposts – that enables the robots to interpret foundational rules, safety measures, and embodiment limitations in their task handling. Extensive experimentation validates that AutoRT collected data is vastly more diverse and the system's ability to align robot actions with human preferences. The system demonstrated allows a single human to supervise up to five robots, significantly amplifying deployment capabilities and diversity of autonomously collected data.

Conclusion and Future Work

The presented AutoRT system exemplifies an innovative move towards autonomous, large-scale, and "in-the-wild" data acquisition by robotic systems. The orchestrated approach has yielded a substantial corpus of diverse, real-world robotic interactions. Despite its proven efficacy, AutoRT is not without its limitations, primarily rooted in the complexity of real-world interactions, diversity of data, and the necessity of human oversight. Going forward, integrating robotic data collection more tightly with policy improvement and addressing the system's ability to handle "sparse" data represent pivotal areas for further research. The paper suggests fascinating future inquiries into how robots autonomously interface with our environment and points towards a future where robotic data collection could parallel the extensive scope of foundation models.

Youtube Logo Streamline Icon: https://streamlinehq.com