Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Toward General-Purpose Robots via Foundation Models: A Survey and Meta-Analysis (2312.08782v3)

Published 14 Dec 2023 in cs.RO, cs.AI, cs.CV, and cs.LG
Toward General-Purpose Robots via Foundation Models: A Survey and Meta-Analysis

Abstract: Building general-purpose robots that operate seamlessly in any environment, with any object, and utilizing various skills to complete diverse tasks has been a long-standing goal in Artificial Intelligence. However, as a community, we have been constraining most robotic systems by designing them for specific tasks, training them on specific datasets, and deploying them within specific environments. These systems require extensively-labeled data and task-specific models. When deployed in real-world scenarios, such systems face several generalization issues and struggle to remain robust to distribution shifts. Motivated by the impressive open-set performance and content generation capabilities of web-scale, large-capacity pre-trained models (i.e., foundation models) in research fields such as NLP and Computer Vision (CV), we devote this survey to exploring (i) how these existing foundation models from NLP and CV can be applied to the field of general-purpose robotics, and also exploring (ii) what a robotics-specific foundation model would look like. We begin by providing a generalized formulation of how foundation models are used in robotics, and the fundamental barriers to making generalist robots universally applicable. Next, we establish a taxonomy to discuss current work exploring ways to leverage existing foundation models for robotics and develop ones catered to robotics. Finally, we discuss key challenges and promising future directions in using foundation models for enabling general-purpose robotic systems. We encourage readers to view our living GitHub repository 2 of resources, including papers reviewed in this survey, as well as related projects and repositories for developing foundation models for robotics.

Evolution of Robotics with Foundation Models

Introduction to Foundation Models in Robotics

The field of robotics has long been focused on developing systems shaped for particular tasks, trained on specific datasets, and limited to defined environments. These systems often suffer from challenges such as data scarcity, lack of generalization, and robustness when faced with real-world scenarios. Encouraged by the success of foundational models in NLP and computer vision (CV), researchers are now exploring their application to robotics. Foundation models like LLMs, Vision Foundation Models (VFMs), and others possess qualities that align well with the vision for general-purpose robots—those that can seamlessly operate across various tasks and environments without extensive retraining.

Robotics and Foundation Models

Robotics systems comprise several core functionalities, including perception, decision-making and planning, and action generation. Each of these functionalities presents its own set of challenges. For example, perception systems need varied data to understand scenes and objects, while planning and control must adapt to new environments. The entry of foundation models into this domain aims at leveraging their strong generalization and learning abilities to address these hurdles, potentially smoothing the path toward truly adaptable and intelligent robotic systems.

Addressing Core Robotics Challenges

Foundation models shine brightly when examining their impact on classical challenges in robotics:

  • Generalization: Taking cues from the human brain's modularity and the adaptability seen in nature, foundation models offer a promising route to achieve a similar level of function-centric generalization in robotics.
  • Data Scarcity: Through the ability to generate synthetic data and learn from limited examples, foundation models are positioned to tackle the constraints imposed by the requirement for large and diverse datasets.
  • Model Dependency: Reducing the reliance on meticulously crafted models for the environment and robot dynamics can be advanced with model-agnostic foundation models.
  • Task Specification: Foundation models open up avenues for natural and intuitive ways of specifying goals for robotic tasks, such as through language, images, or code.
  • Uncertainty and Safety: Ensuring safe operation and managing uncertainty remain underexplored, but are areas where foundation models could potentially offer rigorous frameworks and contributions.

Research Methodologies and Evaluations

Numerous studies have explored applying foundation models to various tasks, leading to several observations:

  • Task Focus: There's a notable skew toward general pick-and-place tasks. The translation from text to motion, particularly with LLMs, has been less ventured into, especially for complex tasks like dexterous manipulation.
  • Simulation and Real-World Data: The balance between simulations and real-world data is critical. Robust simulators enable vast data generation, yet may lack the diversity and richness of real-world data, highlighting the need for ongoing efforts in both areas.
  • Performance and Benchmarking: Advancements are being made in testing foundation models in diverse tasks, but a unified approach to performance measurement and benchmarking is yet something to develop.

Future Directions in Foundation Models and Robotics

Looking ahead, several areas are ripe for exploration:

  • Enhanced Grounding: Developing a profound connection between model output and physical robotic actions remains a fruitful avenue for research.
  • Continual Learning: Adapting to changing environments and tasks without forgetting past learning is a frontier yet to be fully conquered by robotic foundation models.
  • Hardware Innovations: Complementary hardware innovations are necessary to enrich the data available for training foundation models and to expand the conceptual learning space.
  • Cross-Embodiment Adaptability: Learning control policies that are adaptable to diverse physical embodiments is a critical step toward creating more universal robotic systems.

The application of foundation models to robotics holds the promise of achieving a higher level of autonomy, adaptability, and intelligence in robotic systems. As the field progresses, the blend of robust AI models and robotics could usher in a new era of smart, versatile machines ready to meet the complexities and unpredictability of the real world.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (353)
  1. Vision meets robotics: The kitti dataset. International Journal of Robotics Research (IJRR), 2013.
  2. Real-time semantic mapping for autonomous off-road navigation. In Field and Service Robotics, pages 335–350. Springer, 2018.
  3. Yale-cmu-berkeley dataset for robotic manipulation research. In International Journal of Robotics Research, page 261 – 268, 2017.
  4. Decaf: A deep convolutional activation feature for generic visual recognition. In ICML, 2014.
  5. Adversarial discriminative domain adaptation. In CVPR, 2017.
  6. Learning domain-independent planning heuristics with hypergraph networks. In Proceedings of the International Conference on Automated Planning and Scheduling, volume 30, pages 574–584, 2020.
  7. Learning value functions with relational state representations for guiding task-and-motion planning. In Conference on Robot Learning, pages 955–968. PMLR, 2020.
  8. Aggressive driving with model predictive path integral control. In ICRA, 2016.
  9. Motion planning networks: Bridging the gap between learning-based and classical motion planners. IEEE Transactions on Robotics, pages 1–9, 2020.
  10. Motion policy networks. In Proceedings of the 6th Conference on Robot Learning (CoRL), 2022.
  11. Learning agile robotic locomotion skills by imitating animals. In RSS, 2020.
  12. End-to-end training of deep visuomotor policies. In Journal of Machine Learning Research, 2016.
  13. Learning agile and dynamic motor skills for legged robots. In Science Robotics, 30 Jan 2019.
  14. Deep dynamics models for learning dexterous manipulation. In CoRL, 2019.
  15. Dmitry Kalashnkov and Jake Varley and Yevgen Chebotar and Ben Swanson and Rico Jonschkowski and Chelsea Finn and Sergey Levine and Karol Hausman. Mt-opt: Continuous multi-task robotic reinforcement learning at scale. arXiv:2104.08212, 2021.
  16. BC-Z: Zero-Shot Task Generalization with Robotic Imitation Learning. In 5th Annual Conference on Robot Learning, 2021.
  17. Language models are few-shot learners, 2020.
  18. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  19. Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487, 2022.
  20. Emerging properties in self-supervised vision transformers. In Proceedings of the International Conference on Computer Vision (ICCV), 2021.
  21. Segment anything. arXiv:2304.02643, 2023.
  22. Dinov2: Learning robust visual features without supervision, 2023.
  23. Rishi Bommasani et. al. from the Center for Research on Foundation Models (CRFM) at the Stanford Institute for Human-Centered Artificial Intelligence (HAI). On the opportunities and risks of foundation models. In arXiv:2108.07258, 2021.
  24. Ahn et. al. Do as i can, not as i say: Grounding language in robotic affordances. In CoRL, 2022.
  25. Open-vocabulary queryable scene representations for real world planning. In arXiv:2209.09874, 2022.
  26. Saytap: Language to quadrupedal locomotion, 2023.
  27. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022.
  28. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In arXiv preprint arXiv:2307.15818, 2023.
  29. Vint: A foundation model for visual navigation. In arxiv preprint arXiv:2306.14846, 2023.
  30. A generalist agent. In Transactions on Machine Learning Research (TMLR), November 10, 2022.
  31. PaLM-E: An embodied multimodal language model. ArXiv, abs/2303.03378, 2023.
  32. Challenges and applications of large language models. arXiv:2307.10169, 2023.
  33. Text-to-image diffusion models in generative ai: A survey. arXiv:2303.07909, 2023.
  34. Harnessing the power of llms in practice: A survey on chatgpt and beyond. arXiv:2304.13712, 2023.
  35. Foundation models for decision making: Problems, methods, and opportunities. arXiv:2303.04129, 2023.
  36. A survey on segment anything model (sam): Vision foundation model meets prompt engineering, 2023.
  37. Foundational models defining a new era in vision: A survey and outlook, 2023.
  38. A survey of vision-language pre-trained models. IJCAI-2022 survey track, 2022.
  39. A systematic survey of prompt engineering on vision-language foundation models. arXiv:2307.12980, 2023.
  40. A survey on large language model based autonomous agents. arXiv:2308.11432, 2023.
  41. The development of llms for embodied navigation. In IEEE/ASME TRANSACTIONS ON MECHATRONICS, volume 1, Sept. 2023.
  42. Anirudha Majumdar. Robotics: An idiosyncratic snapshot in the age of llms, 8 2023.
  43. Robot learning in the era of foundation models: A survey, 2023.
  44. Foundation models in robotics: Applications, challenges, and the future, 2023.
  45. Vincent Vanhoucke. The end-to-end false dichotomy: Roboticists arguing lego vs. playmo. Medium, October 28 2018.
  46. Yuke Zhu. Cs391r: Robot learning, 2021.
  47. Core challenges in embodied vision-language planning. Journal of Artificial Intelligence Research, 74:459–515, 2022.
  48. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS, 2015.
  49. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018.
  50. Airobject: A temporally evolving graph embedding for object identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8407–8416, 2022.
  51. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In NIPS, 2017.
  52. Nerf-slam: Real-time dense monocular slam with neural radiance fields. arXiv preprint arXiv:2210.13641, 2022.
  53. Learning transferable visual models from natural language supervision. In ICML, 2021.
  54. Orb-slam: a versatile and accurate monocular slam system. IEEE transactions on robotics, 31(5):1147–1163, 2015.
  55. Direct sparse odometry. IEEE transactions on pattern analysis and machine intelligence, 40(3):611–625, 2017.
  56. Ldso: Direct sparse odometry with loop closure. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 2198–2204. IEEE, 2018.
  57. Lsd-slam: Large-scale direct monocular slam. In European conference on computer vision, pages 834–849. Springer, 2014.
  58. Direct sparse mapping. IEEE Transactions on Robotics, 2020.
  59. Tp-tio: A robust thermal-inertial odometry with deep thermalpoint. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4505–4512. IEEE, 2020.
  60. Real-time loop closure in 2d lidar slam. In 2016 IEEE international conference on robotics and automation (ICRA), pages 1271–1278. IEEE, 2016.
  61. A flexible and scalable slam system with full 3d motion estimation. In Proc. IEEE International Symposium on Safety, Security and Rescue Robotics (SSRR). IEEE, November 2011.
  62. Ji Zhang and Sanjiv Singh. Loam: Lidar odometry and mapping in real-time. In Robotics: Science and systems, volume 2, pages 1–9. Berkeley, CA, 2014.
  63. Cerberus: Low-drift visual-inertial-leg odometry for agile locomotion. ICRA, 2023.
  64. Imu preintegration on manifold for efficient visual-inertial maximum-a-posteriori estimation. Technical report, EPFL, 2015.
  65. Ji Zhang and Sanjiv Singh. Visual-lidar odometry and mapping: Low-drift, robust, and fast. In ICRA, 2015.
  66. Limo: Lidar-monocular visual odometry. In 2018 IEEE/RSJ international conference on intelligent robots and systems (IROS), pages 7872–7879. IEEE, 2018.
  67. Viral slam: Tightly coupled camera-imu-uwb-lidar slam. arXiv preprint arXiv:2105.03296, 2021.
  68. Lio-sam: Tightly-coupled lidar inertial odometry via smoothing and mapping. In 2020 IEEE/RSJ international conference on intelligent robots and systems (IROS), pages 5135–5142. IEEE, 2020.
  69. Super odometry: Imu-centric lidar-visual-inertial estimator for challenging environments. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 8729–8736. IEEE, 2021.
  70. Deepvo: Towards end-to-end visual odometry with deep recurrent convolutional neural networks. In 2017 IEEE international conference on robotics and automation (ICRA), pages 2043–2050. IEEE, 2017.
  71. Tartanvo: A generalizable learning-based vo. In CoRL, 2020.
  72. Unsupervised learning of depth and ego-motion from video. In CVPR, 2017.
  73. Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras. In NeurIPS, 2021.
  74. Nicer-slam: Neural implicit scene encoding for rgb slam. arXiv preprint arXiv:2302.03594, 2023.
  75. Splatam: Splat, track & map 3d gaussians for dense rgb-d slam. arXiv preprint arXiv:2312.02126, 2023.
  76. The curious robot: Learning visual representations via physical interactions. In ECCV, 2016.
  77. Learning to look around: Intelligently exploring unseen environments for unknown tasks. In CVPR, 2018.
  78. Neu-nbv: Next best view planning using uncertainty estimation in image-based neural rendering. arXiv preprint arXiv:2303.01284, 2023.
  79. Off-Policy Evaluation with Online Adaptation for Robot Exploration in Challenging Environments. In IEEE Robotics and Automation Letters (RA-L), 2023.
  80. A formal basis for the heuristic determination of minimum cost paths. IEEE Transactions on Systems Science and Cybernetics, 4(2):100–107, 1968.
  81. Anytime safe interval path planning for dynamic environments. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 4708–4715, 2012.
  82. Multi-heuristic A. In Dieter Fox, Lydia E. Kavraki, and Hanna Kurniawati, editors, Robotics: Science and Systems X, University of California, Berkeley, USA, July 12-16, 2014, 2014.
  83. Path planning for non-circular micro aerial vehicles in constrained environments. In 2013 IEEE International Conference on Robotics and Automation, pages 3933–3940, 2013.
  84. Single- and dual-arm motion planning with heuristic search. Int. J. Robotics Res., 33(2):305–320, 2014.
  85. Steven M LaValle et al. Rapidly-exploring random trees: A new tool for path planning. Technical report, Iowa State University, 1998.
  86. Rrt-connect: An efficient approach to single-query path planning. In Proceedings 2000 ICRA. Millennium Conference. IEEE International Conference on Robotics and Automation. Symposia Proceedings (Cat. No.00CH37065), volume 2, pages 995–1001 vol.2, 2000.
  87. Probabilistic roadmaps for path planning in high-dimensional configuration spaces. IEEE Transactions on Robotics and Automation, 12(4):566–580, 1996.
  88. Sampling-based algorithms for optimal motion planning. The international journal of robotics research, 30(7):846–894, 2011.
  89. Batch informed trees (bit): Sampling-based optimal planning via the heuristically guided search of implicit random geometric graphs. In 2015 IEEE international conference on robotics and automation (ICRA), pages 3067–3074. IEEE, 2015.
  90. Regionally accelerated batch informed trees (rabit): A framework to integrate local information into optimal path planning. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pages 4207–4214. IEEE, 2016.
  91. Automated Planning and Acting. Cambridge University Press, 2016.
  92. Integrated Task and Motion Planning. In arXiv:2010.01083, 2010.
  93. Offline reinforcement learning for visual navigation. In CoRL, 2022.
  94. Fastrlap: A system for learning high-speed driving via deep rl and autonomous practicing. arXiv pre-print, 2023.
  95. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. In NeurIPS 2020 Tutorial, 2020.
  96. Peorl: Integrating symbolic planning and hierarchical reinforcement learning for robust decision-making. arXiv preprint arXiv:1804.07779, 2018.
  97. Task-motion planning with reinforcement learning for adaptable mobile service robots. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 7529–7534. IEEE, 2019.
  98. Active exploration for learning symbolic representations. Advances in Neural Information Processing Systems, 30, 2017.
  99. From skills to symbols: Learning symbolic representations for abstract high-level planning. Journal of Artificial Intelligence Research, 61:215–289, 2018.
  100. A survey of motion planning and control techniques for self-driving urban vehicles. IEEE Transactions on intelligent vehicles, 1(1):33–55, 2016.
  101. Distribution-aware goal prediction and conformant model-based planning for safe autonomous driving. ICML Workshop on Safe Learning for Autonomous Driving, 2022.
  102. Learn-to-race: A multimodal control environment for autonomous racing. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9793–9802, 2021.
  103. Learn-to-race challenge 2022: Benchmarking safe learning and cross-domain generalisation in autonomous racing. ICML Workshop on Safe Learning for Autonomous Driving, 2022.
  104. Model predictive path integral control: From theory to parallel computation. Journal of Guidance, Control, and Dynamics, 40(2):344–357, 2017.
  105. Robot learning from demonstration. In ICML, 1997.
  106. Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, 2013.
  107. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
  108. Playing atari with deep reinforcement learning. In NIPS Deep Learning Workshop, 2013.
  109. Mastering the game of go with deep neural networks and tree search. In Nature, 2016.
  110. Deep drone acrobatics. In Proceedings of Robotics: Science and Systems, Corvalis, Oregon, USA, July 2020.
  111. Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection. In arXiv:1603.02199, 2016.
  112. Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation. In CoRL, 2018.
  113. Learning quadrupedal locomotion over challenging terrain. In Science Robotics, 21 Oct 2020.
  114. A reduction of imitation learning and structured prediction to no-regret online learning. In AISTATS, 2011.
  115. Apprenticeship learning via inverse reinforcement learning. In ICML, 2004.
  116. Maximum entropy inverse reinforcement learning. In AAAI, 2008.
  117. Generative adversarial networks. In NIPS, 2014.
  118. Generative adversarial imitation learning. In NIPS, 2016.
  119. Agile autonomous driving using end-to-end deep imitation learning. In RSS, 2018.
  120. Learning agile skills via adversarial imitation of rough partial demonstrations. In CoRL, 2022.
  121. Reinforcement Learning: An Introduction , second edition. The MIT Press, 2018.
  122. Safe autonomous racing via approximate reachability on ego-vision. arXiv preprint arXiv:2110.07699, 2021.
  123. Deepak Pathak Zipeng Fu, Xuxin Cheng. Deep whole-body control: Learning a unified policy for manipulation and locomotion. In CoRL, 2022.
  124. Extreme parkour with legged robots. In arXiv:2309.14341, 2023.
  125. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Neural Information Processing Systems, 2018.
  126. Unsupervised learning for physical interaction through video prediction. In NIPS, 2016.
  127. Deep visual foresight for planning robot motion. In ICRA, 2017.
  128. S. Levine I. Kostrikov, A. Nair. Offline reinforcement learning with implicit q-learning. In ICLR, 2022.
  129. Mopo: Model-based offline policy optimization. In NeurIPS, 2020.
  130. Plas: Latent action space for offline reinforcement learning. In Conference on Robot Learning (CoRL), 2020.
  131. Offline reinforcement learning from images with latent space models. In Proceedings of Machine Learning Research, volume 144:1–15, 2021.
  132. Masked autoencoders are scalable vision learners. In CVPR, 2022.
  133. What do self-supervised vision transformers learn? In ICLR, 2023.
  134. Objectives matter: Understanding the impact of self-supervised objectives on vision transformer representations. arXiv:2304.13089, 2023.
  135. Conceptfusion: Open-set multimodal 3d mapping. RSS, 2023.
  136. Deep vit features as dense visual descriptors. arXiv:2112.05814, 2021.
  137. Open-vocabulary panoptic segmentation with text-to-image diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2955–2966, 2023.
  138. Yann LeCun. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. Open Review, 62, 2022.
  139. Imagebind: One embedding space to bind them all. arXiv preprint arXiv:2211.05778, 2022.
  140. Bootstrap your own latent: A new approach to self-supervised learning. In NeurIPS, 2020.
  141. Self-supervised learning from images with a joint-embedding predictive architecture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15619–15629, 2023.
  142. Mc-jepa: A joint-embedding predictive architecture for self-supervised learning of motion and content features. arXiv preprint arXiv:2307.12698, 2023.
  143. Denoising diffusion probabilistic models. arXiv preprint arXiv:2006.11239, 2006.
  144. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
  145. Improving language understanding by generative pre-training. Technical report, OpenAI, 2018.
  146. Language models are unsupervised multitask learners. Technical report, OpenAI, 2019.
  147. Bert: Pre-training of deep bidirectional transformers for language understanding, 2019.
  148. Harnessing the power of llms in practice: A survey on chatgpt and beyond, 2023.
  149. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
  150. Open-vocabulary object detection via vision and language knowledge distillation. In ICLR, 2022.
  151. Language-driven semantic segmentation. In ICLR, 2022.
  152. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In arXiv:1908.02265, 2019.
  153. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, 2021.
  154. Vl-beit: Generative vision-language pretraining, 2022.
  155. Flamingo: a visual language model for few-shot learning. ArXiv, abs/2204.14198, 2022.
  156. GIT: A generative image-to-text transformer for vision and language. Transactions on Machine Learning Research, 2022.
  157. BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, 2022.
  158. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. CoRR, abs/2301.12597, 2023.
  159. OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023.
  160. Next-gpt: Any-to-any multimodal llm, 2023.
  161. Audiogpt: Understanding and generating speech, music, sound, and talking head, 2023.
  162. Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities, 2023.
  163. Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding, 2023.
  164. What went wrong? closing the sim-to-real gap via differentiable causal discovery. In 7th Annual Conference on Robot Learning, 2023.
  165. Jonathan Francis. Knowledge-enhanced Representation Learning for Multiview Context Understanding. PhD thesis, Carnegie Mellon University, 2022.
  166. Transferring implicit knowledge of non-visual object properties across heterogeneous robot morphologies. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 11315–11321. IEEE, 2023.
  167. Cross-tool and cross-behavior perceptual knowledge transfer for grounded object recognition. arXiv preprint arXiv:2303.04023, 2023.
  168. Scalability in perception for autonomous driving: Waymo open dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
  169. Deep rl at scale: Sorting waste in office buildings with a fleet of mobile manipulators. In Robotics: Science and Systems (RSS), 2023.
  170. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ international conference on intelligent robots and systems, pages 5026–5033. IEEE, 2012.
  171. Isaac gym: High performance gpu-based physics simulation for robot learning, 2021.
  172. Orbit: A unified simulation framework for interactive robot learning environments, 2023.
  173. Habitat: A platform for embodied ai research. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9339–9347, 2019.
  174. Habitat 2.0: Training home assistants to rearrange their habitat, 2022.
  175. Habitat 3.0: A co-habitat for humans, avatars and robots, 2023.
  176. Embodiment Collaboration. Open x-embodiment: Robotic learning datasets and rt-x models, 2023.
  177. Bayesian multi-task learning mpc for robotic mobile manipulation, 2023.
  178. TARE: A Hierarchical Framework for Efficiently Exploring Complex 3D Environments. In ICRA, 2023.
  179. Provably constant-time planning and replanning for real-time grasping objects off a conveyor belt. In RSS, 2020.
  180. Manipulation planning among movable obstacles using physics-based adaptive motion primitives. In 2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, May 2021.
  181. Can foundation models perform zero-shot task specification for robot manipulation? In Learning for Dynamics and Control Conference, pages 893–905. PMLR, 2022.
  182. Eureka: Human-level reward design via coding large language models. In 2nd Workshop on Language and Robot Learning: Language as Grounding, 2023.
  183. A survey of uncertainty in deep neural networks. Artificial Intelligence Review, 56(Suppl 1):1513–1589, 2023.
  184. Interpretable deep learning: Interpretation, interpretability, trustworthiness, and beyond. Knowledge and Information Systems, 64(12):3197–3234, 2022.
  185. Just ask: An interactive learning framework for vision and language navigation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 2459–2466, 2020.
  186. A gentle introduction to conformal prediction and distribution-free uncertainty quantification. arXiv preprint arXiv:2107.07511, 2021.
  187. Robots that ask for help: Uncertainty alignment for large language model planners. In 7th Annual Conference on Robot Learning, 2023.
  188. Data-driven safety filters: Hamilton-jacobi reachability, control barrier functions, and predictive methods for uncertain systems. IEEE Control Systems Magazine, 43(5):137–177, 2023.
  189. The safety filter: A unified view of safety-critical control in autonomous systems. arXiv preprint arXiv:2309.05837, 2023.
  190. Control barrier functions: Theory and applications. In 2019 18th European control conference (ECC), pages 3420–3431. IEEE, 2019.
  191. Hamilton-jacobi reachability: A brief overview and recent advances. In 2017 IEEE 56th Annual Conference on Decision and Control (CDC), pages 2242–2253. IEEE, 2017.
  192. Backpropagation through signal temporal logic specifications: Infusing logical structure into gradient-based methods. The International Journal of Robotics Research, 42(6):356–370, 2023.
  193. A review of safe reinforcement learning: Methods, theory and applications. arXiv preprint arXiv:2205.10330, 2022.
  194. Safe control with learned certificates: A survey of neural lyapunov, barrier, and contraction methods for robotics and control. IEEE Transactions on Robotics, 2023.
  195. Robocat: A self-improving foundation agent for robotic manipulation, 2023.
  196. Cliport: What and where pathways for robotic manipulation. In CoRL, 2021.
  197. Distilled feature fields enable few-shot language-guided manipulation. CoRL, 2023.
  198. Multi-task real robot learning with generalizable neural feature fields. CoRL, 2023.
  199. Fm-loc: Using foundation models for improved vision-based localization. arXiv:2304.07058, 2023.
  200. Mosaic: Learning unified multi-sensory object property representations for robot perception. arXiv preprint arXiv:2309.08508, 2023.
  201. Visual language maps for robot navigation. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), London, UK, 2023.
  202. Clip-fields: Weakly supervised semantic fields for robotic memory. In RSS, 2023.
  203. Conceptfusion: Open-set multimodal 3d mapping. In arXiv:2302.07241, 2023.
  204. Lm-nav: Robotic navigation with large pre-trained models of language, vision, and action. In CoRL, 2022.
  205. The homerobot open vocab mobile manipulation challenge. In Thirty-seventh Conference on Neural Information Processing Systems: Competition Track, 2023.
  206. Act3d: 3d feature field transformers for multi-task robotic manipulation, 2023.
  207. Language-extended indoor slam (lexis): A versatile system for real-time visual scene understanding. arXiv preprint arXiv:2309.15065, 2023.
  208. Anyloc: Towards universal visual place recognition. RA-L, 2023.
  209. Foundloc: Vision-based onboard aerial localization in the wild, 2023.
  210. Grounding semantic categories in behavioral interactions: Experiments with 100 objects. Robotics and Autonomous Systems, 62(5):632–645, may 2014.
  211. Learning relational object categories using behavioral exploration and multimodal perception. In International Conference on Robotics and Automation (ICRA), pages 5691–5698, Hong Kong, China, may 2014. IEEE.
  212. Learning haptic representation for manipulating deformable food objects. In Intelligent Robots and Systems (IROS), pages 638–645, Chicago, IL, USA, Sep 2014. IEEE.
  213. Sensorimotor cross-behavior knowledge transfer for grounded category recognition. In International Conference on Development and Learning and Epigenetic Robotics (ICDL-EpiRob). IEEE, 2019.
  214. A framework for sensorimotor cross-perception and cross-behavior knowledge transfer for object categorization. Frontiers in Robotics and AI, 7:137, 2020.
  215. Haptic knowledge transfer between heterogeneous robots using kernel manifold alignment. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2020.
  216. Code as policies: Language model programs for embodied control. ArXiv, abs/2209.07753, 2023.
  217. Video language planning. arXiv preprint arXiv:2310.10625, 2023.
  218. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3674–3683, 2018.
  219. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents, 2022.
  220. Text2motion: From natural language instructions to feasible plans. arXiv preprint arXiv:2303.12153, 2023.
  221. ProgPrompt: Generating situated robot task plans using large language models, 2022.
  222. Gensim: Generating robotic simulation tasks via large language models. In CoRL, 2023.
  223. Pddl generators, 2022.
  224. Sayplan: Grounding large language models using 3d scene graphs for scalable task planning. In 7th Annual Conference on Robot Learning, 2023.
  225. Reasoning about the unseen for efficient outdoor object navigation, 2023.
  226. Language to rewards for robotic skill synthesis. Arxiv preprint arXiv:2306.08647, 2023.
  227. Voxposer: Composable 3d value maps for robotic manipulation with language models. arXiv preprint arXiv:2307.05973, 2023.
  228. How to not train your dragon: Training-free embodied object goal navigation with semantic frontiers. arXiv preprint arXiv:2305.16925, 2023.
  229. Prompt a robot to walk with large language models, 2023.
  230. Guiding pretraining in reinforcement learning with large language models, 2023.
  231. Reward design with language models, 2023.
  232. Text2reward: Automated dense reward function generation for reinforcement learning, 2023.
  233. Grounded decoding: Guiding text generation with grounded models for robot control, 2023.
  234. Grounding large language models in interactive environments with online reinforcement learning, 2023.
  235. Scaling up and distilling down: Language-guided robot skill acquisition. In Proceedings of the 2023 Conference on Robot Learning, 2023.
  236. Robogen: Towards unleashing infinite data for automated robot learning via generative simulation, 2023.
  237. Scaling robot learning with semantically imagined experience. In arXiv:2302.11550, 2023.
  238. Rt-trajectory: Robotic task generalization via hindsight trajectory sketches, 2023.
  239. Zero-shot robotic manipulation with pretrained image-editing diffusion models, 2023.
  240. Learning universal policies via text-guided video generation. In NeurIPS, 2023.
  241. Chain-of-thought prompting elicits reasoning in large language models, 2023.
  242. Faith and fate: Limits of transformers on compositionality, 2023.
  243. Graph of thoughts: Solving elaborate problems with large language models, 2023.
  244. Tree of thoughts: Deliberate problem solving with large language models, 2023.
  245. Planning with large language models for code generation. In The Eleventh International Conference on Learning Representations, 2023.
  246. Llm+p: Empowering large language models with optimal planning proficiency, 2023.
  247. Inner monologue: Embodied reasoning through planning with language models, 2022.
  248. Open-world object manipulation using pre-trained vision-language models. arXiv preprint arXiv:2303.00905, 2023.
  249. Pre-trained language models for interactive decision-making. Advances in Neural Information Processing Systems, 35:31199–31212, 2022.
  250. Lila: Language-informed latent actions. In Conference on Robot Learning, pages 1379–1390. PMLR, 2022.
  251. Language instructed reinforcement learning for human-ai coordination. arXiv preprint arXiv:2304.07297, 2023.
  252. Interactive language: Talking to robots in real time. arXiv preprint arXiv:2210.06407, 2022.
  253. Pi-qt-opt: Predictive information improves multi-task robotic reinforcement learning at scale. CoRL, 2022.
  254. Deep rl at scale: Sorting waste in office buildings with a fleet of mobile manipulators. In RSS, 2023.
  255. Q-transformer: Scalable offline reinforcement learning via autoregressive q-functions. In CoRL, 2023.
  256. Pre-training for robots: Offline rl enables learning new tasks from a handful of trials. In RSS, 2023.
  257. Conservative q-learning for offline reinforcement learning. In NeurIPS, 2020.
  258. The unsurprising effectiveness of pre-trained vision models for control, 2022.
  259. R3m: A universal visual representation for robot manipulation, 2022.
  260. Real-world robot learning with masked visual pre-training. In Conference on Robot Learning, pages 416–426. PMLR, 2023.
  261. Robot learning with sensorimotor pre-training, 2023.
  262. On pre-training for visuo-motor control: Revisiting a learning-from-scratch baseline. In ICML, 2023.
  263. Where are we in the search for an artificial visual cortex for embodied intelligence?, 2023.
  264. Pali-x: On scaling up a multilingual vision and language model, 2023.
  265. Affordances from human videos as a versatile representation for robotics. CVPR, 2023.
  266. Gnm: A general navigation model to drive any robot. In ICRA, 2023.
  267. Indoorsim-to-outdoorreal: Learning to navigate outdoors without any outdoor experience. In arXiv:2305.01098, 2023.
  268. Pact: Perception-action causal transformer for autoregressive robotics pre-training. In arXiv:2209.11133, 2022.
  269. Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning. arXiv preprint arXiv:2309.16650, 2023.
  270. Reasoning with language model is planning with world model. arXiv preprint arXiv:2305.14992, 2023.
  271. VIMA: general robot manipulation with multimodal prompts. ArXiv, abs/2210.03094, 2022.
  272. Human-to-robot imitation in the wild. In RSS, 2022.
  273. Transformers are adaptable task planners. In 6th Annual Conference on Robot Learning, 2022.
  274. Modular multitask reinforcement learning with policy sketches. In International conference on machine learning, pages 166–175. PMLR, 2017.
  275. Using a hand-drawn sketch to control a team of robots. Autonomous Robots, 22:399–410, 2007.
  276. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  277. Towards open vocabulary learning: A survey. arXiv preprint arXiv:2306.15880, 2023.
  278. Holistic analysis of hallucination in gpt-4v (ision): Bias and interference challenges. arXiv preprint arXiv:2311.03287, 2023.
  279. Physically grounded vision-language models for robotic manipulation. In arxiv, 2023.
  280. Robonet: Large-scale multi-robot learning, 2020.
  281. Bridge data: Boosting generalization of robotic skills with cross-domain datasets, 2021.
  282. Bridgedata v2: A dataset for robot learning at scale. arXiv preprint arXiv:2308.12952, 2023.
  283. Leap hand: Low-cost, efficient, and anthropomorphic hand for robot learning. arXiv preprint arXiv:2309.06440, 2023.
  284. Matterport3d: Learning from rgb-d data in indoor environments. arXiv preprint arXiv:1709.06158, 2017.
  285. Gibson env: Real-world perception for embodied agents. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 9068–9079, 2018.
  286. Habitat: A Platform for Embodied AI Research. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019.
  287. Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:1712.05474, 2017.
  288. Film: Following instructions in language with modular methods, 2022.
  289. Airsim: High-fidelity visual and physical simulation for autonomous vehicles, 2017.
  290. Google DeepMind. Mujoco 3.0. https://github.com/google-deepmind/mujoco/releases/tag/3.0.0, 2023. Accessed: [Insert date of access].
  291. Clvr jaco play dataset. https://github.com/clvrai/clvr-jaco-play-dataset, 2023.
  292. Multi-stage cable routing through hierarchical imitation learning, 2023.
  293. The surprising effectiveness of representation learning for visual imitation. arXiv preprint arXiv:2112.01511, 2021.
  294. Berkeley ur5 demonstration dataset. https://sites.google.com/view/berkeley-ur5/home. Accessed: [Insert Date Here].
  295. Latent plans for task agnostic offline reinforcement learning. In Proceedings of the 6th Conference on Robot Learning (CoRL), 2022.
  296. Chat with the environment: Interactive multimodal perception using large language models, 2023.
  297. Coppelia Robotics. Coppeliasim. https://www.coppeliarobotics.com/. Accessed: [Insert Date Here].
  298. E. Coumans and Y. Bai. Pybullet, a python module for physics simulation for games, robotics and machine learning.
  299. Instruct2act: Mapping multi-modality instructions to robotic actions with large language model, 2023.
  300. Sapien: A simulated part-based interactive environment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11097–11107, 2020.
  301. Where2act: From pixels to actions for articulated 3d objects, 2021.
  302. Beyond pick-and-place: Tackling robotic stacking of diverse shapes, 2021.
  303. Large language models as generalizable policies for embodied tasks, 2023.
  304. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
  305. Transporter networks: Rearranging the visual world for robotic manipulation. Conference on Robot Learning (CoRL), 2020.
  306. Liv: Language-image representations and rewards for robotic control, 2023.
  307. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning, 2021.
  308. Tidybot: Personalized robot assistance with large language models. arXiv preprint arXiv:2305.05658, 2023.
  309. Task and motion planning with large language models for object rearrangement. arXiv preprint arXiv:2303.06247, 2023.
  310. N. Koenig and A. Howard. Design and use paradigms for gazebo, an open-source multi-robot simulator. In 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No.04CH37566), volume 3, pages 2149–2154 vol.3, 2004.
  311. Habitat-matterport 3d semantics dataset, 2023.
  312. Imagination-augmented agents for deep reinforcement learning. arXiv preprint arXiv:1707.06203, 2017.
  313. Babyai: A platform to study the sample efficiency of grounded language learning. arXiv preprint arXiv:1810.08272, 2018.
  314. Leveraging procedural generation to benchmark reinforcement learning. In International conference on machine learning, pages 2048–2056. PMLR, 2020.
  315. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.
  316. Deepmind control suite. arXiv preprint arXiv:1801.00690, 2018.
  317. Grounded decoding: Guiding text generation with grounded models for robot control. ArXiv, abs/2303.00855, 2023.
  318. Minimalistic gridworld environment for gymnasium. https://github.com/pierg/environments-rl, 2018.
  319. Towards human-level bimanual dexterous manipulation with reinforcement learning. Advances in Neural Information Processing Systems, 35:5150–5163, 2022.
  320. Predictive sampling: Real-time behaviour synthesis with mujoco, 2022.
  321. The mit humanoid robot: Design, motion planning, and control for acrobatic behaviors. In 2020 IEEE-RAS 20th International Conference on Humanoid Robots (Humanoids), pages 1–8. IEEE, 2021.
  322. Model-free safe control for zero-violation reinforcement learning. In 5th Annual Conference on Robot Learning, 2021.
  323. Scalable learning of safety guarantees for autonomous systems using hamilton-jacobi reachability, 2021.
  324. Safety assurances for human-robot interaction via confidence-aware game-theoretic human models. In 2022 International Conference on Robotics and Automation (ICRA), pages 11229–11235. IEEE, 2022.
  325. A new concept of safety affordance map for robots object manipulation. In 2018 27th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN), pages 565–570, 2018.
  326. Plug in the safety chip: Enforcing constraints for LLM-driven robot agents. In 2nd Workshop on Language and Robot Learning: Language as Grounding, 2023.
  327. Modular brain networks. Annual review of psychology, 67:613–640, 2016.
  328. Modular and hierarchically modular organization of brain networks. Frontiers in neuroscience, 4:200, 2010.
  329. How to prompt your robot: A promptbook for manipulation skills with code as policies. In 2nd Workshop on Language and Robot Learning: Language as Grounding, 2023.
  330. Keto: Learning keypoint representations for tool manipulation, 2019.
  331. Gift: Generalizable interaction-aware functional tool affordances without labels. ArXiv, abs/2106.14973, 2021.
  332. Learning generalizable tool-use skills through trajectory generation. ArXiv, abs/2310.00156, 2023.
  333. Shadow Robot Company. Dexterous hand series. https://www.shadowrobot.com/dexterous-hand-series/, 2023. Accessed: 2023-12-10.
  334. Improved gelsight tactile sensor for measuring geometry and slip. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, September 2017.
  335. Robotsweater: Scalable, generalizable, and customizable machine-knitted tactile skins for robots. arXiv preprint arXiv:2303.02858, 2023.
  336. Learning fine-grained bimanual manipulation with low-cost hardware, 2023.
  337. Realtime qa: What’s the answer right now?, 2022.
  338. Dsi++: Updating transformer memory with new documents. ArXiv, abs/2212.09744, 2022.
  339. An empirical investigation of the role of pre-training in lifelong learning. Journal of Machine Learning Research, 24(214):1–50, 2023.
  340. Construct-vl: Data-free continual structured vl concepts learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14994–15004, 2023.
  341. Continual learning for robotics: Definition, framework, learning strategies, opportunities and challenges. Information Fusion, 58:52–68, June 2020.
  342. Active incremental learning of robot movement primitives. In Sergey Levine, Vincent Vanhoucke, and Ken Goldberg, editors, Proceedings of the 1st Annual Conference on Robot Learning, volume 78 of Proceedings of Machine Learning Research, pages 37–46. PMLR, 13–15 Nov 2017.
  343. Rma: Rapid motor adaptation for legged robots. In Robotics: Science and Systems, 2021.
  344. Inverse preference learning: Preference-based rl without a reward function, 2023.
  345. Reinforcement learning with human feedback: Learning dynamic choices via pessimism, 2023.
  346. Fine-tuning can distort pretrained features and underperform out-of-distribution. In International Conference on Learning Representations, 2022.
  347. Solving rubik’s cube with a robot hand, 2019.
  348. Sample efficient reinforcement learning from human feedback via active exploration, 2023.
  349. Direct preference-based policy optimization without reward modeling, 2023.
  350. Compositional foundation models for hierarchical planning, 2023.
  351. Meta learning shared hierarchies, 2017.
  352. Hierarchical foresight: Self-supervised learning of long-horizon tasks via visual subgoal generation. In International Conference on Learning Representations, 2020.
  353. Slowfast networks for video recognition, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (23)
  1. Yafei Hu (7 papers)
  2. Quanting Xie (3 papers)
  3. Vidhi Jain (12 papers)
  4. Jonathan Francis (48 papers)
  5. Jay Patrikar (17 papers)
  6. Nikhil Keetha (10 papers)
  7. Seungchan Kim (12 papers)
  8. Yaqi Xie (23 papers)
  9. Tianyi Zhang (262 papers)
  10. Chen Wang (599 papers)
  11. Katia Sycara (93 papers)
  12. Matthew Johnson-Roberson (72 papers)
  13. Dhruv Batra (160 papers)
  14. Xiaolong Wang (243 papers)
  15. Sebastian Scherer (163 papers)
  16. Zsolt Kira (110 papers)
  17. Fei Xia (111 papers)
  18. Yonatan Bisk (91 papers)
  19. Shibo Zhao (14 papers)
  20. Hao-Shu Fang (38 papers)
Citations (43)
Github Logo Streamline Icon: https://streamlinehq.com