Emergent Mind

Abstract

We survey applications of pretrained foundation models in robotics. Traditional deep learning models in robotics are trained on small datasets tailored for specific tasks, which limits their adaptability across diverse applications. In contrast, foundation models pretrained on internet-scale data appear to have superior generalization capabilities, and in some instances display an emergent ability to find zero-shot solutions to problems that are not present in the training data. Foundation models may hold the potential to enhance various components of the robot autonomy stack, from perception to decision-making and control. For example, LLMs can generate code or provide common sense reasoning, while vision-language models enable open-vocabulary visual recognition. However, significant open research challenges remain, particularly around the scarcity of robot-relevant training data, safety guarantees and uncertainty quantification, and real-time execution. In this survey, we study papers that have used or built foundation models to solve robotics problems. We explore how foundation models contribute to improving robot capabilities in the domains of perception, decision-making, and control. We discuss the challenges hindering the adoption of foundation models in robot autonomy and provide opportunities and potential pathways for future advancements. The GitHub project corresponding to this paper (Preliminary release. We are committed to further enhancing and updating this work to ensure its quality and relevance) can be found here: https://github.com/robotics-survey/Awesome-Robotics-Foundation-Models

Overview

  • Foundation models are large-scale machine learning models pre-trained on diverse datasets and can be fine-tuned for various robotics tasks.

  • They can enhance decision-making, control, perception, and task planning in robotics, leveraging models like GPT-3, CLIP, and DALL-E.

  • Robotics faces challenges integrating foundation models, including data scarcity, safety concerns, uncertain decision-making, and the need for real-time processing.

  • Techniques such as unstructured play data usage, uncertainty quantification methods, and high-fidelity simulators are proposed to address these challenges.

  • Future research in robotics aims to develop reliable, efficient, and safe robots that can adapt to a broad range of tasks using foundation models.

Understanding Foundation Models in Robotics

Introduction to Foundation Models

Foundation models are a type of machine learning model that is pre-trained on massive, diverse data sets, enabling them to learn general-purpose representations and skills. These models can then be fine-tuned or adapted to a wide array of downstream tasks. Examples include BERT for text processing and GPT for text generation, as well as models like CLIP and DALL-E that work across both vision and language. In robotics, these models hold promise for enhancing perception, decision-making, control, and even task planning. They can generate code, provide common-sense reasoning, and recognize visual concepts in an open-ended manner. However, realizing their potential in robotics also presents unique challenges, particularly regarding training data scarcity, safety, uncertainty quantification, and achieving real-time performance.

Applications and Advancements

Foundation models offer significant advancements for robotics in several areas:

Decision Making and Control:

  • Robots can learn policies from human demonstrations, including from unstructured play data, which is easier to collect.
  • Robots can be trained to respond to language instructions and reinforcement learning signals, integrating models like GPT-3 for task decomposition and achieving a human-like understanding of instructions and control signals.

Perception Capabilities:

Embodied AI and Generalist Agents:

  • Research in embodied AI focuses on using foundation models to endow robots with versatile skills, such as navigation and task planning.
  • Generalist agents are trained on various simulations or real-world tasks to become adaptable across multiple scenarios and tasks.

Challenges in Robotic Integration

Incorporating foundation models into robotics comes with several challenges:

Training Data Scarcity:

  • Robotics-specific data is limited compared to the internet-scale text and image data used to train many foundation models.
  • Techniques to tackle this issue include leveraging unstructured play data, data augmentation methods, and high-fidelity simulators.

Uncertainty and Safety in Decision Making:

  • Since foundation models can sometimes produce incorrect outputs, quantifying uncertainty and ensuring safety in robotic applications is crucial.
  • Research efforts focus on uncertainty quantification methods that give robots the ability to ask for help when unsure, ensuring more reliable operations.

Real-Time Performance:

Variability in Robotic Settings:

  • Robots operate in diverse environments with different physical attributes and tasks. Creating general-purpose, cross-embodiment foundation models that capture a wide range of robotic data is essential for broader applicability.

Benchmarking and Reproducibility:

  • The variation in simulation environments and hardware specifics makes benchmarking and reproducing results challenging. Open hardware initiatives and transparent experimental setups can help address this issue.

The Road Ahead

The integration of foundation models in robotics is an active area of development. Future research directions include creating reliable, real-time capable models, generating robotics-specific training data, and building safety mechanisms for autonomous operations. The ultimate goal is to develop versatile robots that can operate safely and effectively in complex real-world scenarios, leveraging the vast learning potential of foundation models.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

References
  1. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
  2. Language models are few-shot learners. NeurIPS, 33:1877–1901
  3. GPT-4 Technical Report
  4. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors, ICML, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR, 18–24 Jul 2021.
  5. Zero-shot text-to-image generation. In ICML, pages 8821–8831. PMLR
  6. PaLM-E: An Embodied Multimodal Language Model
  7. PlaTe: Visually-grounded planning with transformers in procedural tasks. IEEE Robotics and Automation Letters, 7(2):4924–4930
  8. Large ai models in health informatics: Applications, challenges, and the future. IEEE Journal of Biomedical and Health Informatics, 27(12):6074–6087
  9. Reasoning with foundation models: Concepts, methodologies, and outlook. In Zenodo preprint 10.5281/zenodo.10298866
  10. SAM3D: Zero-Shot 3D Object Detection via Segment Anything Model
  11. 3D-LLM: Injecting the 3D World into Large Language Models
  12. Leveraging Large (Visual) Language Models for Robot 3D Scene Understanding
  13. Foundation Models for Decision Making: Problems, Methods, and Opportunities
  14. CACTI: A Framework for Scalable Multi-Task Multi-Scene Visual Imitation Learning
  15. Towards a unified agent with foundation models. In Workshop on Reincarnating Reinforcement Learning at ICLR 2023
  16. Reward design with language models. In ICLR
  17. ChessGPT: Bridging Policy Learning and Language Modeling
  18. Zero-shot Visual Question Answering with Language Model Feedback
  19. AnnoLLM: Making Large Language Models to Be Better Crowdsourced Annotators
  20. Robot Learning in the Era of Foundation Models: A Survey
  21. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1–35
  22. Transformers in vision: A survey. ACM Computing Surveys, 54(10s):1–41
  23. Large sequence models for sequential decision-making: a survey. Frontiers of Computer Science, 17(6):176349
  24. On the Opportunities and Risks of Foundation Models
  25. CLIPort: What and where pathways for robotic manipulation. In CoRL, pages 894–906. PMLR
  26. Learning latent plans from play. In CoRL, pages 1113–1132. PMLR
  27. Perceiver-Actor: A multi-task transformer for robotic manipulation. In CoRL, pages 785–799. PMLR
  28. Language conditioned imitation learning over unstructured data. Robotics: Science and Systems
  29. Language-driven representation learning for robotics. In RSS
  30. Human-timescale adaptation in an open-ended task space. In ICML
  31. R3M: A Universal Visual Representation for Robot Manipulation
  32. Do as I can, not as I say: Grounding language in robotic affordances. In CoRL, pages 287–318. PMLR
  33. Inner Monologue: Embodied Reasoning through Planning with Language Models
  34. VoxPoser: Composable 3D value maps for robotic manipulation with language models. In CoRL
  35. Zero-shot reward specification via grounded natural language. In ICML, pages 14743–14752. PMLR
  36. VIP: Towards universal visual reward and representation via value-implicit pre-training. In ICLR
  37. LIV: Language-image representations and rewards for robotic control. In ICML
  38. Learning language-conditioned robot behavior from offline data and crowd-sourced annotation. In CoRL, pages 1303–1315. PMLR, 08–11 Nov 2022.
  39. NL2TL: Transforming Natural Languages to Temporal Logics using Large Language Models
  40. AutoTAMP: Autoregressive Task and Motion Planning with LLMs as Translators and Checkers
  41. ProgPrompt: Generating situated robot task plans using large language models. In ICRA, pages 11523–11530. IEEE
  42. Code as Policies: Language model programs for embodied control. In ICRA, pages 9493–9500. IEEE
  43. ChatGPT for Robotics: Design principles and model abilities. Technical Report MSR-TR-2023-8, Microsoft
  44. RT-1: Robotics transformer for real-world control at scale. In RSS
  45. RT-2: Vision-language-action models transfer web knowledge to robotic control. In CoRL
  46. Open X-Embodiment: Robotic Learning Datasets and RT-X Models
  47. PACT: Perception-action causal transformer for autoregressive robotics pretraining. In IROS. IEEE
  48. Masked Visual Pre-training for Motor Control
  49. Real-world robot learning with masked visual pre-training. In CoRL, pages 416–426. PMLR
  50. LATTE: LAnguage Trajectory TransformEr. In ICRA, pages 7287–7294. IEEE
  51. Simple open-vocabulary object detection with vision transformers. In ECCV, pages 728–755. Springer
  52. Grounded language-image pre-training. In CVPR, pages 10965–10975
  53. Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
  54. PointCLIP: Point cloud understanding by CLIP. In CVPR, pages 8552–8562
  55. Point-BERT: Pre-training 3D point cloud transformers with masked point modeling. In CVPR, pages 19313–19322
  56. ULIP: Learning a Unified Representation of Language, Images, and Point Clouds for 3D Understanding
  57. ULIP-2: Towards Scalable Multimodal Pre-training for 3D Understanding
  58. Language-driven semantic segmentation. In ICLR
  59. Segment anything. In ICCV, pages 4015–4026, October 2023.
  60. Fast Segment Anything
  61. Faster Segment Anything: Towards Lightweight SAM for Mobile Applications
  62. Track Anything: Segment Anything Meets Videos
  63. CLIP-NeRF: Text-and-image driven manipulation of neural radiance fields. In CVPR, pages 3835–3844
  64. LERF: Language embedded radiance fields. In ICCV, pages 19729–19739
  65. Decomposing NeRF for editing via feature field distillation. In NeurIPS
  66. Affordance diffusion: Synthesizing hand-object interactions. In CVPR
  67. Affordances from human videos as a versatile representation for robotics. In CVPR
  68. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. In ICML
  69. Statler: State-Maintaining Language Models for Embodied Reasoning
  70. EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought
  71. MineDojo: Building open-ended embodied agents with internet-scale knowledge. In NeurIPS Datasets and Benchmarks Track
  72. Video PreTraining (VPT): Learning to act by watching unlabeled online videos. In NeurIPS
  73. Voyager: An Open-Ended Embodied Agent with Large Language Models
  74. Guiding pretraining in reinforcement learning with large language models. In ICML, pages 8657–8677. PMLR, 23–29 Jul 2023.
  75. Neural machine translation of rare words with subword units. In ACL
  76. Attention is all you need. NeurIPS
  77. Daniel Dugas. The gpt-3 architecture, on a napkin. [Online; accessed 28-November-2023].
  78. Wikipedia. GPT-3. [Online; accessed 28-November-2023].
  79. John Thickstun. The Transformer Model in Equations. [Online; accessed 28-November-2023].
  80. Time series analysis: forecasting and control. John Wiley & Sons
  81. Improving language understanding by generative pre-training. https://openai.com/research/language-unsupervised

  82. Generating Wikipedia by summarizing long sequences. In ICLR
  83. Contrastive learning of medical visual representations from paired images and text. In MLHC
  84. A simple framework for contrastive learning of visual representations. In ICML
  85. Kihyuk Sohn. Improved deep metric learning with multi-class N-pair loss objective. In NeurIPS
  86. Unsupervised feature learning via non-parametric instance discrimination. In CVPR
  87. Representation Learning with Contrastive Predictive Coding
  88. Hierarchical Text-Conditional Image Generation with CLIP Latents
  89. Deep unsupervised learning using nonequilibrium thermodynamics. In ICML
  90. Generative modeling by estimating gradients of the data distribution. In NeurIPS
  91. Denoising diffusion probabilistic models. In NeurIPS
  92. Language models are unsupervised multitask learners. OpenAI Blog
  93. The Winograd schema challenge. In KR
  94. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
  95. LLaMA: Open and Efficient Foundation Language Models
  96. PaLM: Scaling Language Modeling with Pathways
  97. GLM-130B: An open bilingual pre-trained model. In ICLR
  98. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR
  99. A survey on vision transformer. IEEE Transactions on Pattern Analysis and Machine Intelligence
  100. Transformers in vision: A survey. ACM Computing Surveys
  101. Scaling vision transformers. In CVPR
  102. PaLI: A jointly-scaled multilingual language-image model. In NeurIPS
  103. Scaling vision transformers to 22 billion parameters. In ICML
  104. PaLI-X: On Scaling up a Multilingual Vision and Language Model
  105. Emerging properties in self-supervised vision transformers. In ICCV
  106. Deep Residual Learning for Image Recognition
  107. DINOv2: Learning Robust Visual Features without Supervision
  108. BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML
  109. CLIP22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT: Contrastive language-image-point pretraining from real-world point cloud data. In CVPR
  110. FILIP: Fine-grained interactive language-image pre-training. In ICLR
  111. Scaling language-image pre-training via masking. In CVPR
  112. Zero-shot text-to-image generation. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 8821–8831. PMLR, 18–24 Jul 2021.
  113. Masked autoencoders are scalable vision learners. In CVPR, pages 16000–16009
  114. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763
  115. Policy adaptation from foundation model feedback. In CVPR
  116. Transporter networks: Rearranging the visual world for robotic manipulation. In CoRL
  117. High-resolution image synthesis with latent diffusion models. In CVPR
  118. MimicPlay: Long-horizon imitation learning by watching human play. In CoRL
  119. MUTEX: Learning unified policies from multimodal task specifications. In CoRL
  120. RL$^2$: Fast Reinforcement Learning via Slow Reinforcement Learning
  121. Transformer-XL: Attentive language models beyond a fixed-length context. In ACL
  122. Open-Ended Learning Leads to Generally Capable Agents
  123. Scaling egocentric vision: The EPIC-KITCHENS dataset. In ECCV
  124. Text2Motion: From natural language instructions to feasible plans. Autonomous Robots. Special Issue: Large Language Models in Robotics
  125. Grounding language with visual affordances over unstructured data. In ICRA, pages 11576–11582. IEEE
  126. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551
  127. Learning a decision module by imitating driver’s control behaviors. In CoRL, pages 1–10. PMLR
  128. Neuro-symbolic program search for autonomous driving decision module design. In CoRL, pages 21–30. PMLR
  129. VirtualHome: Simulating household activities via programs. In CVPR, pages 8494–8502
  130. A Survey on In-context Learning
  131. Chain-of-thought prompting elicits reasoning in large language models. NeurIPS, 35:24824–24837
  132. Large Language Models as General Pattern Machines
  133. Chain-of-Thought Predictive Control
  134. SMART: Self-supervised multi-task pretraining with control transformers. In ICLR
  135. Improving vision-and-language navigation with image-text pairs from the web. In ECCV, pages 259–274. Springer
  136. LM-Nav: Robotic navigation with large pre-trained models of language, vision, and action. In CoRL, pages 492–504. PMLR
  137. ViNT: A Foundation Model for Visual Navigation. In CoRL
  138. Audio Visual Language Maps for Robot Navigation
  139. NetVLAD: CNN architecture for weakly supervised place recognition. In CVPR, pages 5297–5307
  140. Superpoint: Self-supervised interest point detection and description. In CVPR Deep Learning for Visual SLAM Workshop
  141. AudioCLIP: Extending clip to image, text and audio. In ICASSP, pages 976–980. IEEE
  142. Navigating to Objects in the Real World
  143. CoWs on pasture: Baselines and benchmarks for language-driven zero-shot object navigation. In CVPR, pages 23171–23181
  144. Matterport3D: Learning from RGB-D data in indoor environments. In 3DV, pages 667–676
  145. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In CVPR, pages 3674–3683
  146. iGibson 2.0: Object-centric simulation for robot learning of everyday household tasks. In CoRL
  147. Habitat 2.0: Training home assistants to rearrange their habitat. In NeurIPS
  148. L3MVN: Leveraging Large Language Models for Visual Target Navigation
  149. RoBERTa: A Robustly Optimized BERT Pretraining Approach
  150. How To Not Train Your Dragon: Training-free Embodied Object Goal Navigation with Semantic Frontiers
  151. HomeRobot: Open-Vocabulary Mobile Manipulation
  152. VIMA: General robot manipulation with multimodal prompts. In ICML
  153. RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation
  154. A Generalist Agent
  155. StructDiffusion: Language-guided creation of physically-valid structures using unseen objects. In RSS
  156. Open-World Object Manipulation using Pre-trained Vision-Language Models
  157. DALL-E-Bot: Introducing web-scale diffusion models to robotics. IEEE Robotics and Automation Letters, 8(7):3956–3963
  158. Mask R-CNN. In ICCV, pages 2980–2988
  159. UL2: Unifying Language Learning Paradigms
  160. COMPASS:: Contrastive multimodal pretraining for autonomous systems. In IROS, pages 1000–1007. IEEE
  161. Multimodal machine learning: A survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell., 41(2):423–443, feb 2019.
  162. Partslip: Low-shot part segmentation for 3d point clouds via pretrained image-language models. In CVPR, pages 21736–21746
  163. ShapeNet: An Information-Rich 3D Model Repository
  164. Vision transformers for dense prediction. In ICCV, pages 12179–12188
  165. XMem: Long-term video object segmentation with an atkinson-shiffrin memory model. In ECCV, pages 640–658. Springer
  166. Anything-3D: Towards Single-view Anything Reconstruction in the Wild
  167. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106
  168. NeRF-Loc: Transformer-based object localization within neural radiance fields. IEEE Robotics and Automation Letters, 8(8):5244–5250
  169. Aria-NeRF: Multimodal Egocentric View Synthesis
  170. CLIP-Fields: Weakly supervised semantic fields for robotic memory. In RSS
  171. Visual language maps for robot navigation. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 10608–10615. IEEE
  172. Semantic abstraction: Open-world 3D scene understanding from 2D vision-language models. In CoRL
  173. FeatureNeRF: Learning Generalizable NeRFs by Distilling Foundation Models
  174. Neural Feature Fusion Fields: 3D distillation of self-supervised 2D image representations. In 3DV, pages 443–453. IEEE
  175. Nerflets: Local radiance fields for efficient structure-aware 3d scene representation from 2d supervision. In CVPR, pages 8274–8284
  176. Scaling Robot Learning with Semantically Imagined Experience
  177. Imagen editor and editbench: Advancing and evaluating text-guided image inpainting. In CVPR, pages 18359–18369
  178. GenAug: Retargeting behaviors to unseen situations via generative augmentation. In RSS
  179. Neural Descriptor Fields: SE(3)-equivariant object representations for manipulation. In ICRA
  180. Distilled feature fields enable few-shot manipulation. In CoRL
  181. You only look at one: Category-level object representations for pose estimation from a single example. In CoRL
  182. Zero-shot category-level object pose estimation. In ECCV
  183. Shinji Umeyama. Least-squares estimation of transformation parameters between two point patterns. IEEE Transactions on Pattern Analysis & Machine Intelligence, 13(04):376–380
  184. Adversarial inverse reinforcement learning with self-attention dynamics model. IEEE Robotics and Automation Letters, 6(2):1880–1886
  185. Connected Autonomous Vehicle Motion Planning with Video Predictions from Smart, Self-Supervised Infrastructure
  186. Self-supervised traffic advisors: Distributed, multi-view traffic prediction for smart cities. In IEEE ITSC
  187. Planning with diffusion for flexible behavior synthesis. In ICML
  188. Conformal prediction for uncertainty-aware planning with diffusion dynamics model. In NeurIPS
  189. Phenaki: Variable length video generation from open domain textual description. In ICLR
  190. RoboNet: Large-scale multi-robot learning. In CoRL
  191. GAIA-1: A Generative World Model for Autonomous Driving
  192. Learning universal policies via text-guided video generation. In NeurIPS
  193. Video Language Planning
  194. Socratic Models: Composing zero-shot multimodal reasoning with language. arXiv
  195. Collaborating with language models for embodied reasoning. In Second Workshop on Language and Reinforcement Learning
  196. Transforming minecraft into a research platform. In IEEE CCNC
  197. Ghost in the Minecraft: Generally Capable Agents for Open-World Environments via Large Language Models with Text-based Knowledge and Memory
  198. Can Offline Reinforcement Learning Help Natural Language Understanding?
  199. Prompts and pre-trained language models for offline reinforcement learning. In ICLR Workshop on Generalizable Policy Learning in Physical World
  200. Can Wikipedia Help Offline Reinforcement Learning?
  201. Evaluating Large Language Models Trained on Code
  202. ReAct: Synergizing reasoning and acting in language models. In ICLR
  203. Generative Agents: Interactive simulacra of human behavior. In ACM Symposium on User Interface Software and Technology
  204. Towards Generalist Robots: A Promising Paradigm via Generative Simulation
  205. RRL: Resnet as representation for reinforcement learning. In ICML
  206. Gibson Env: real-world perception for embodied agents. In CVPR
  207. BEHAVIOR-1K: A benchmark for embodied AI with 1,000 everyday activities and realistic simulation. In CoRL
  208. Habitat: A Platform for Embodied AI Research. In ICCV
  209. Habitat 3.0: A Co-Habitat for Humans, Avatars and Robots
  210. RoboTHOR: An open simulation-to-real embodied AI platform. In CVPR
  211. TartanAir: A dataset to push the limits of visual SLAM. In IROS
  212. AirSim: High-fidelity visual and physical simulation for autonomous vehicles. In Field and Service Robotics
  213. Robotic skill acquisition via instruction augmentation with vision-language models. In RSS
  214. AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
  215. Robots that ask for help: Uncertainty alignment for large language model planners. In CoRL
  216. Quantifying Representation Reliability in Self-Supervised Learning Models
  217. Quantifying uncertainty in foundation models via ensembles. In NeurIPS Workshop on Robustness in Sequence Modeling
  218. Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
  219. Red teaming language models with language models. In EMNLP
  220. Mass-Producing Failures of Multimodal Systems with Language Models
  221. Collision Avoidance Testing of the Waymo Automated Driving System
  222. Waymo's Safety Methodologies and Safety Readiness Determinations
  223. A survey on safety-critical driving scenario generation—a methodological perspective. IEEE ITSC
  224. Sample-efficient safety assurances using conformal prediction. In WAFR
  225. Failure Prediction with Statistical Guarantees for Vision-Based Robot Control
  226. Task-relevant failure detection for trajectory predictors in autonomous vehicles. In CoRL
  227. Sim-to-Lab-to-Real: Safe reinforcement learning with shielding and generalization guarantees. Artificial Intelligence, 314:103811
  228. The Safety Filter: A Unified View of Safety-Critical Control in Autonomous Systems
  229. Task-driven out-of-distribution detection with statistical guarantees for robot learning. In CoRL
  230. Detecting rewards deterioration in episodic reinforcement learning. In ICML
  231. Real-time out-of-distribution detection in learning-enabled cyber-physical systems. In ICCPS
  232. A System-Level View on Out-of-Distribution Data in Robotics

Show All 232