Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Foundation Models in Robotics: Applications, Challenges, and the Future (2312.07843v1)

Published 13 Dec 2023 in cs.RO
Foundation Models in Robotics: Applications, Challenges, and the Future

Abstract: We survey applications of pretrained foundation models in robotics. Traditional deep learning models in robotics are trained on small datasets tailored for specific tasks, which limits their adaptability across diverse applications. In contrast, foundation models pretrained on internet-scale data appear to have superior generalization capabilities, and in some instances display an emergent ability to find zero-shot solutions to problems that are not present in the training data. Foundation models may hold the potential to enhance various components of the robot autonomy stack, from perception to decision-making and control. For example, LLMs can generate code or provide common sense reasoning, while vision-LLMs enable open-vocabulary visual recognition. However, significant open research challenges remain, particularly around the scarcity of robot-relevant training data, safety guarantees and uncertainty quantification, and real-time execution. In this survey, we study papers that have used or built foundation models to solve robotics problems. We explore how foundation models contribute to improving robot capabilities in the domains of perception, decision-making, and control. We discuss the challenges hindering the adoption of foundation models in robot autonomy and provide opportunities and potential pathways for future advancements. The GitHub project corresponding to this paper (Preliminary release. We are committed to further enhancing and updating this work to ensure its quality and relevance) can be found here: https://github.com/robotics-survey/Awesome-Robotics-Foundation-Models

Understanding Foundation Models in Robotics

Introduction to Foundation Models

Foundation models are a type of machine learning model that is pre-trained on massive, diverse data sets, enabling them to learn general-purpose representations and skills. These models can then be fine-tuned or adapted to a wide array of downstream tasks. Examples include BERT for text processing and GPT for text generation, as well as models like CLIP and DALL-E that work across both vision and language. In robotics, these models hold promise for enhancing perception, decision-making, control, and even task planning. They can generate code, provide common-sense reasoning, and recognize visual concepts in an open-ended manner. However, realizing their potential in robotics also presents unique challenges, particularly regarding training data scarcity, safety, uncertainty quantification, and achieving real-time performance.

Applications and Advancements

Foundation models offer significant advancements for robotics in several areas:

  1. Decision Making and Control:
    • Robots can learn policies from human demonstrations, including from unstructured play data, which is easier to collect.
    • Robots can be trained to respond to language instructions and reinforcement learning signals, integrating models like GPT-3 for task decomposition and achieving a human-like understanding of instructions and control signals.
  2. Perception Capabilities:
    • Open-vocabulary object detection enables robots to identify and classify objects they have never encountered, with models like GLIP, OWL-ViT, and DINO enhancing object-level recognition.
    • Semantic segmentation leverages LLMs to classify each pixel in an image with semantic meaning, aiding in tasks like scene understanding and navigation.
  3. Embodied AI and Generalist Agents:
    • Research in embodied AI focuses on using foundation models to endow robots with versatile skills, such as navigation and task planning.
    • Generalist agents are trained on various simulations or real-world tasks to become adaptable across multiple scenarios and tasks.

Challenges in Robotic Integration

Incorporating foundation models into robotics comes with several challenges:

  1. Training Data Scarcity:
    • Robotics-specific data is limited compared to the internet-scale text and image data used to train many foundation models.
    • Techniques to tackle this issue include leveraging unstructured play data, data augmentation methods, and high-fidelity simulators.
  2. Uncertainty and Safety in Decision Making:
    • Since foundation models can sometimes produce incorrect outputs, quantifying uncertainty and ensuring safety in robotic applications is crucial.
    • Research efforts focus on uncertainty quantification methods that give robots the ability to ask for help when unsure, ensuring more reliable operations.
  3. Real-Time Performance:
    • The high inference times of foundation models pose a bottleneck for real-time robotic applications, demanding further research in computational efficiency and network reliability.
  4. Variability in Robotic Settings:
    • Robots operate in diverse environments with different physical attributes and tasks. Creating general-purpose, cross-embodiment foundation models that capture a wide range of robotic data is essential for broader applicability.
  5. Benchmarking and Reproducibility:
    • The variation in simulation environments and hardware specifics makes benchmarking and reproducing results challenging. Open hardware initiatives and transparent experimental setups can help address this issue.

The Road Ahead

The integration of foundation models in robotics is an active area of development. Future research directions include creating reliable, real-time capable models, generating robotics-specific training data, and building safety mechanisms for autonomous operations. The ultimate goal is to develop versatile robots that can operate safely and effectively in complex real-world scenarios, leveraging the vast learning potential of foundation models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (232)
  1. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  2. Language models are few-shot learners. NeurIPS, 33:1877–1901, 2020.
  3. OpenAI. GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  4. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors, ICML, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR, 18–24 Jul 2021.
  5. Zero-shot text-to-image generation. In ICML, pages 8821–8831. PMLR, 2021.
  6. PaLM-E: An embodied multimodal language model. In arXiv preprint arXiv:2303.03378, 2023.
  7. PlaTe: Visually-grounded planning with transformers in procedural tasks. IEEE Robotics and Automation Letters, 7(2):4924–4930, 2022.
  8. Large ai models in health informatics: Applications, challenges, and the future. IEEE Journal of Biomedical and Health Informatics, 27(12):6074–6087, 2023.
  9. Reasoning with foundation models: Concepts, methodologies, and outlook. In Zenodo preprint 10.5281/zenodo.10298866, 2023.
  10. SAM3D: Zero-shot 3D object detection via segment anything model. arXiv preprint arXiv:2306.02245, 2023.
  11. 3D-LLM: Injecting the 3D world into large language models. arXiv preprint arXiv:2307.12981, 2023.
  12. Leveraging large language models for robot 3D scene understanding. arXiv preprint arXiv:2209.05629, 2022.
  13. Foundation models for decision making: Problems, methods, and opportunities. arXiv preprint arXiv:2303.04129, 2023.
  14. CACTI: A framework for scalable multi-task multi-scene visual imitation learning. arXiv preprint arXiv:2212.05711, 2022.
  15. Towards a unified agent with foundation models. In Workshop on Reincarnating Reinforcement Learning at ICLR 2023, 2023.
  16. Reward design with language models. In ICLR, 2023.
  17. ChessGPT: Bridging policy learning and language modeling. arXiv preprint arXiv:2306.09200, 2023.
  18. Zero-shot visual question answering with language model feedback. arXiv preprint arXiv:2305.17006, 2023.
  19. AnnoLLM: Making large language models to be better crowdsourced annotators. arXiv preprint arXiv:2303.16854, 2023.
  20. Robot learning in the era of foundation models: A survey. arXiv preprint arXiv:2311.14379, 2023.
  21. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1–35, 2023.
  22. Transformers in vision: A survey. ACM Computing Surveys, 54(10s):1–41, 2022.
  23. Large sequence models for sequential decision-making: a survey. Frontiers of Computer Science, 17(6):176349, 2023.
  24. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
  25. CLIPort: What and where pathways for robotic manipulation. In CoRL, pages 894–906. PMLR, 2022.
  26. Learning latent plans from play. In CoRL, pages 1113–1132. PMLR, 2020.
  27. Perceiver-Actor: A multi-task transformer for robotic manipulation. In CoRL, pages 785–799. PMLR, 2023.
  28. Language conditioned imitation learning over unstructured data. Robotics: Science and Systems, 2021.
  29. Language-driven representation learning for robotics. In RSS, 2023.
  30. Human-timescale adaptation in an open-ended task space. In ICML, 2023.
  31. R3M: A universal visual representation for robot manipulation. arXiv preprint arXiv:2203.12601, 2022.
  32. Do as I can, not as I say: Grounding language in robotic affordances. In CoRL, pages 287–318. PMLR, 2023.
  33. Inner monologue: Embodied reasoning through planning with language models. In arXiv preprint arXiv:2207.05608, 2022.
  34. VoxPoser: Composable 3D value maps for robotic manipulation with language models. In CoRL, 2023.
  35. Zero-shot reward specification via grounded natural language. In ICML, pages 14743–14752. PMLR, 2022.
  36. VIP: Towards universal visual reward and representation via value-implicit pre-training. In ICLR, 2023.
  37. LIV: Language-image representations and rewards for robotic control. In ICML, 2023.
  38. Learning language-conditioned robot behavior from offline data and crowd-sourced annotation. In CoRL, pages 1303–1315. PMLR, 08–11 Nov 2022.
  39. NL2TL: Transforming natural languages to temporal logics using large language models. arXiv preprint arXiv:2305.07766, 2023.
  40. AutoTAMP: Autoregressive task and motion planning with llms as translators and checkers. arXiv preprint arXiv:2306.06531, 2023.
  41. ProgPrompt: Generating situated robot task plans using large language models. In ICRA, pages 11523–11530. IEEE, 2023.
  42. Code as Policies: Language model programs for embodied control. In ICRA, pages 9493–9500. IEEE, 2023.
  43. ChatGPT for Robotics: Design principles and model abilities. Technical Report MSR-TR-2023-8, Microsoft, 2023.
  44. RT-1: Robotics transformer for real-world control at scale. In RSS, 2023.
  45. RT-2: Vision-language-action models transfer web knowledge to robotic control. In CoRL, 2023.
  46. Open X-Embodiment: Robotic learning datasets and RT-X models. arXiv preprint arXiv:2310.08864, 2023.
  47. PACT: Perception-action causal transformer for autoregressive robotics pretraining. In IROS. IEEE, 2023.
  48. Masked visual pre-training for motor control. arXiv preprint arXiv:2203.06173, 2022.
  49. Real-world robot learning with masked visual pre-training. In CoRL, pages 416–426. PMLR, 2023.
  50. LATTE: LAnguage Trajectory TransformEr. In ICRA, pages 7287–7294. IEEE, 2023.
  51. Simple open-vocabulary object detection with vision transformers. In ECCV, pages 728–755. Springer, 2022.
  52. Grounded language-image pre-training. In CVPR, pages 10965–10975, 2022.
  53. Grounding DINO: Marrying DINO with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023.
  54. PointCLIP: Point cloud understanding by CLIP. In CVPR, pages 8552–8562, 2022.
  55. Point-BERT: Pre-training 3D point cloud transformers with masked point modeling. In CVPR, pages 19313–19322, 2022.
  56. ULIP: Learning unified representation of language, image and point cloud for 3D understanding. arXiv preprint arXiv:2212.05171, 2022.
  57. ULIP-2: Towards scalable multimodal pre-training for 3d understanding. arXiv preprint arXiv:2305.08275, 2023.
  58. Language-driven semantic segmentation. In ICLR, 2022.
  59. Segment anything. In ICCV, pages 4015–4026, October 2023.
  60. Fast segment anything. arXiv preprint arXiv:2306.12156, 2023.
  61. Faster segment anything: Towards lightweight sam for mobile applications. arXiv preprint arXiv:2306.14289, 2023.
  62. Track anything: Segment anything meets videos. arXiv preprint arXiv:2304.11968, 2023.
  63. CLIP-NeRF: Text-and-image driven manipulation of neural radiance fields. In CVPR, pages 3835–3844, 2022.
  64. LERF: Language embedded radiance fields. In ICCV, pages 19729–19739, 2023.
  65. Decomposing NeRF for editing via feature field distillation. In NeurIPS, 2022.
  66. Affordance diffusion: Synthesizing hand-object interactions. In CVPR, 2023.
  67. Affordances from human videos as a versatile representation for robotics. In CVPR, 2023.
  68. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. In ICML, 2022.
  69. Statler: State-maintaining language models for embodied reasoning. arXiv preprint arXiv:2306.17840, 2023.
  70. EmbodiedGPT: Vision-language pre-training via embodied chain of thought. arXiv preprint arXiv:2305.15021, 2023.
  71. MineDojo: Building open-ended embodied agents with internet-scale knowledge. In NeurIPS Datasets and Benchmarks Track, 2022.
  72. Video PreTraining (VPT): Learning to act by watching unlabeled online videos. In NeurIPS, 2022.
  73. Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv: Arxiv-2305.16291, 2023.
  74. Guiding pretraining in reinforcement learning with large language models. In ICML, pages 8657–8677. PMLR, 23–29 Jul 2023.
  75. Neural machine translation of rare words with subword units. In ACL, 2016.
  76. Attention is all you need. NeurIPS, 2017.
  77. Daniel Dugas. The gpt-3 architecture, on a napkin. [Online; accessed 28-November-2023].
  78. Wikipedia. GPT-3. [Online; accessed 28-November-2023].
  79. John Thickstun. The Transformer Model in Equations. [Online; accessed 28-November-2023].
  80. Time series analysis: forecasting and control. John Wiley & Sons, 2015.
  81. Improving language understanding by generative pre-training. https://openai.com/research/language-unsupervised, 2018.
  82. Generating Wikipedia by summarizing long sequences. In ICLR, 2018.
  83. Contrastive learning of medical visual representations from paired images and text. In MLHC, 2022.
  84. A simple framework for contrastive learning of visual representations. In ICML, 2020.
  85. Kihyuk Sohn. Improved deep metric learning with multi-class N-pair loss objective. In NeurIPS, 2016.
  86. Unsupervised feature learning via non-parametric instance discrimination. In CVPR, 2018.
  87. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  88. Hierarchical text-conditional image generation with CLIP latents. arXiv preprint arXiv:2204.06125, 2022.
  89. Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, 2015.
  90. Generative modeling by estimating gradients of the data distribution. In NeurIPS, 2019.
  91. Denoising diffusion probabilistic models. In NeurIPS, 2020.
  92. Language models are unsupervised multitask learners. OpenAI Blog, 2019.
  93. The Winograd schema challenge. In KR, 2012.
  94. GLUE: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018.
  95. LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  96. PaLM: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  97. GLM-130B: An open bilingual pre-trained model. In ICLR, 2023.
  98. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
  99. A survey on vision transformer. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
  100. Transformers in vision: A survey. ACM Computing Surveys, 2022.
  101. Scaling vision transformers. In CVPR, 2022.
  102. PaLI: A jointly-scaled multilingual language-image model. In NeurIPS, 2022.
  103. Scaling vision transformers to 22 billion parameters. In ICML, 2023.
  104. PaLI-X: On scaling up a multilingual vision and language model. arXiv preprint arXiv:2305.18565, 2023.
  105. Emerging properties in self-supervised vision transformers. In ICCV, 2021.
  106. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015.
  107. DINOv2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
  108. BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022.
  109. CLIP22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT: Contrastive language-image-point pretraining from real-world point cloud data. In CVPR, 2023.
  110. FILIP: Fine-grained interactive language-image pre-training. In ICLR, 2022.
  111. Scaling language-image pre-training via masking. In CVPR, 2023.
  112. Zero-shot text-to-image generation. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 8821–8831. PMLR, 18–24 Jul 2021.
  113. Masked autoencoders are scalable vision learners. In CVPR, pages 16000–16009, 2022.
  114. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763, 2021.
  115. Policy adaptation from foundation model feedback. In CVPR, 2023.
  116. Transporter networks: Rearranging the visual world for robotic manipulation. In CoRL, 2020.
  117. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
  118. MimicPlay: Long-horizon imitation learning by watching human play. In CoRL, 2023.
  119. MUTEX: Learning unified policies from multimodal task specifications. In CoRL, 2023.
  120. RL^2: Fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779, 2016.
  121. Transformer-XL: Attentive language models beyond a fixed-length context. In ACL, 2019.
  122. Open-ended learning leads to generally capable agents. arXiv preprint arXiv:2107.12808, 2021.
  123. Scaling egocentric vision: The EPIC-KITCHENS dataset. In ECCV, 2018.
  124. Text2Motion: From natural language instructions to feasible plans. Autonomous Robots. Special Issue: Large Language Models in Robotics, 2023.
  125. Grounding language with visual affordances over unstructured data. In ICRA, pages 11576–11582. IEEE, 2023.
  126. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  127. Learning a decision module by imitating driver’s control behaviors. In CoRL, pages 1–10. PMLR, 2021.
  128. Neuro-symbolic program search for autonomous driving decision module design. In CoRL, pages 21–30. PMLR, 2021.
  129. VirtualHome: Simulating household activities via programs. In CVPR, pages 8494–8502, 2018.
  130. A survey for in-context learning. arXiv preprint arXiv:2301.00234, 2022.
  131. Chain-of-thought prompting elicits reasoning in large language models. NeurIPS, 35:24824–24837, 2022.
  132. Large language models as general pattern machines. In arXiv preprint arXiv:2307.04721, 2023.
  133. Chain-of-thought predictive control. arXiv preprint arXiv:2304.00776, 2023.
  134. SMART: Self-supervised multi-task pretraining with control transformers. In ICLR, 2023.
  135. Improving vision-and-language navigation with image-text pairs from the web. In ECCV, pages 259–274. Springer, 2020.
  136. LM-Nav: Robotic navigation with large pre-trained models of language, vision, and action. In CoRL, pages 492–504. PMLR, 2023.
  137. ViNT: A Foundation Model for Visual Navigation. In CoRL, 2023.
  138. Audio visual language maps for robot navigation. arXiv preprint arXiv:2303.07522, 2023.
  139. NetVLAD: CNN architecture for weakly supervised place recognition. In CVPR, pages 5297–5307, 2016.
  140. Superpoint: Self-supervised interest point detection and description. In CVPR Deep Learning for Visual SLAM Workshop, 2018.
  141. AudioCLIP: Extending clip to image, text and audio. In ICASSP, pages 976–980. IEEE, 2022.
  142. Navigating to objects in the real world. arXiv preprint arXiv:2212.00922, 2023.
  143. CoWs on pasture: Baselines and benchmarks for language-driven zero-shot object navigation. In CVPR, pages 23171–23181, 2023.
  144. Matterport3D: Learning from RGB-D data in indoor environments. In 3DV, pages 667–676, 2017.
  145. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In CVPR, pages 3674–3683, 2018.
  146. iGibson 2.0: Object-centric simulation for robot learning of everyday household tasks. In CoRL, 2021.
  147. Habitat 2.0: Training home assistants to rearrange their habitat. In NeurIPS, 2021.
  148. LL3MVN: Leveraging large language models for visual target navigation. arXiv preprint arXiv:2304.05501, 2023.
  149. RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  150. How to not train your dragon: Training-free embodied object goal navigation with semantic frontiers. arXiv preprint arXiv:2305.16925, 2023.
  151. Homerobot: Open vocabulary mobile manipulation. arXiv preprint arXiv:2306.11565, 2023.
  152. VIMA: General robot manipulation with multimodal prompts. In ICML, 2023.
  153. RoboCat: A self-improving foundation agent for robotic manipulation. arXiv preprint arXiv:2306.11706, 2023.
  154. A generalist agent. arXiv preprint arXiv:2205.06175, 2022.
  155. StructDiffusion: Language-guided creation of physically-valid structures using unseen objects. In RSS, 2023.
  156. Open-world object manipulation using pre-trained vision-language model. arXiv preprint arXiv:2303.00905, 2023.
  157. DALL-E-Bot: Introducing web-scale diffusion models to robotics. IEEE Robotics and Automation Letters, 8(7):3956–3963, 2023.
  158. Mask R-CNN. In ICCV, pages 2980–2988, 2017.
  159. UL2: Unifying language learning paradigms. arXiv preprint arXiv:2205.05131, 2023.
  160. COMPASS:: Contrastive multimodal pretraining for autonomous systems. In IROS, pages 1000–1007. IEEE, 2022.
  161. Multimodal machine learning: A survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell., 41(2):423–443, feb 2019.
  162. Partslip: Low-shot part segmentation for 3d point clouds via pretrained image-language models. In CVPR, pages 21736–21746, 2023.
  163. ShapeNet: An information-rich 3D model repository. arXiv preprint arXiv:1512.03012, 2015.
  164. Vision transformers for dense prediction. In ICCV, pages 12179–12188, 2021.
  165. XMem: Long-term video object segmentation with an atkinson-shiffrin memory model. In ECCV, pages 640–658. Springer, 2022.
  166. Anything-3D: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:2304.10261, 2023.
  167. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021.
  168. NeRF-Loc: Transformer-based object localization within neural radiance fields. IEEE Robotics and Automation Letters, 8(8):5244–5250, 2023.
  169. Aria-NeRF: Multimodal egocentric view synthesis. arXiv preprint arXiv:2311.06455, 2023.
  170. CLIP-Fields: Weakly supervised semantic fields for robotic memory. In RSS, 2023.
  171. Visual language maps for robot navigation. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 10608–10615. IEEE, 2023.
  172. Semantic abstraction: Open-world 3D scene understanding from 2D vision-language models. In CoRL, 2022.
  173. FeatureNeRF: Learning generalizable nerfs by distilling pre-trained vision foundation models. arXiv preprint arXiv:2303.12786, 2023.
  174. Neural Feature Fusion Fields: 3D distillation of self-supervised 2D image representations. In 3DV, pages 443–453. IEEE, 2022.
  175. Nerflets: Local radiance fields for efficient structure-aware 3d scene representation from 2d supervision. In CVPR, pages 8274–8284, 2023.
  176. Scaling robot learning with semantically imagined experience. In arXiv preprint arXiv:2302.11550, 2023.
  177. Imagen editor and editbench: Advancing and evaluating text-guided image inpainting. In CVPR, pages 18359–18369, 2023.
  178. GenAug: Retargeting behaviors to unseen situations via generative augmentation. In RSS, 2023.
  179. Neural Descriptor Fields: SE(3)-equivariant object representations for manipulation. In ICRA, 2022.
  180. Distilled feature fields enable few-shot manipulation. In CoRL, 2023.
  181. You only look at one: Category-level object representations for pose estimation from a single example. In CoRL, 2022.
  182. Zero-shot category-level object pose estimation. In ECCV, 2022.
  183. Shinji Umeyama. Least-squares estimation of transformation parameters between two point patterns. IEEE Transactions on Pattern Analysis & Machine Intelligence, 13(04):376–380, 1991.
  184. Adversarial inverse reinforcement learning with self-attention dynamics model. IEEE Robotics and Automation Letters, 6(2):1880–1886, 2021.
  185. Connected autonomous vehicle motion planning with video predictions from smart, self-supervised infrastructure. arXiv preprint arXiv:2309.07504, 2023.
  186. Self-supervised traffic advisors: Distributed, multi-view traffic prediction for smart cities. In IEEE ITSC, 2022.
  187. Planning with diffusion for flexible behavior synthesis. In ICML, 2022.
  188. Conformal prediction for uncertainty-aware planning with diffusion dynamics model. In NeurIPS, 2023.
  189. Phenaki: Variable length video generation from open domain textual description. In ICLR, 2023.
  190. RoboNet: Large-scale multi-robot learning. In CoRL, 2019.
  191. GAIA-1: A generative world model for autonomous driving. arXiv preprint arXiv:2309.17080, 2023.
  192. Learning universal policies via text-guided video generation. In NeurIPS, 2023.
  193. Video language planning. arXiv preprint arXiv:2310.10625, 2023.
  194. Socratic Models: Composing zero-shot multimodal reasoning with language. arXiv, 2022.
  195. Collaborating with language models for embodied reasoning. In Second Workshop on Language and Reinforcement Learning, 2022.
  196. Transforming minecraft into a research platform. In IEEE CCNC, 2014.
  197. Ghost in the Minecraft: Generally capable agents for open-world environments via large language models with text-based knowledge and memory. arXiv preprint arXiv:2305.17144, 2023.
  198. Can offline reinforcement learning help natural language understanding? arXiv preprint arXiv:2212.03864, 2022.
  199. Prompts and pre-trained language models for offline reinforcement learning. In ICLR Workshop on Generalizable Policy Learning in Physical World, 2022.
  200. Can Wikipedia help offline reinforcement learning? arXiv preprint arXiv:2201.12122, 2022.
  201. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
  202. ReAct: Synergizing reasoning and acting in language models. In ICLR, 2023.
  203. Generative Agents: Interactive simulacra of human behavior. In ACM Symposium on User Interface Software and Technology, 2023.
  204. Towards generalist robots: A promising paradigm via generative simulation. arXiv preprint arXiv:2305.10455, 2023.
  205. RRL: Resnet as representation for reinforcement learning. In ICML, 2021.
  206. Gibson Env: real-world perception for embodied agents. In CVPR, 2018.
  207. BEHAVIOR-1K: A benchmark for embodied AI with 1,000 everyday activities and realistic simulation. In CoRL, 2022.
  208. Habitat: A Platform for Embodied AI Research. In ICCV, 2019.
  209. Habitat 3.0: A co-habitat for humans, avatars and robots. arXiv preprint arXiv:2310.13724, 2023.
  210. RoboTHOR: An open simulation-to-real embodied AI platform. In CVPR, 2020.
  211. TartanAir: A dataset to push the limits of visual SLAM. In IROS, 2020.
  212. AirSim: High-fidelity visual and physical simulation for autonomous vehicles. In Field and Service Robotics, 2017.
  213. Robotic skill acquisition via instruction augmentation with vision-language models. In RSS, 2023.
  214. AWQ: Activation-aware weight quantization for llm compression and acceleration. arXiv preprint arXiv:2306.00978, 2023.
  215. Robots that ask for help: Uncertainty alignment for large language model planners. In CoRL, 2023.
  216. Representation reliability and its impact on downstream tasks. arXiv preprint arXiv:2306.00206, 2023.
  217. Quantifying uncertainty in foundation models via ensembles. In NeurIPS Workshop on Robustness in Sequence Modeling, 2022.
  218. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858, 2022.
  219. Red teaming language models with language models. In EMNLP, 2022.
  220. Mass-producing failures of multimodal systems with language models. arXiv preprint arXiv:2306.12105, 2023.
  221. Collision avoidance testing of the waymo automated driving system. arXiv preprint arXiv:2212.08148, 2022.
  222. Waymo’s safety methodologies and safety readiness determinations. arXiv preprint arXiv:2011.00054, 2020.
  223. A survey on safety-critical driving scenario generation—a methodological perspective. IEEE ITSC, 2023.
  224. Sample-efficient safety assurances using conformal prediction. In WAFR, 2022.
  225. Failure prediction with statistical guarantees for vision-based robot control. arXiv preprint arXiv:2202.05894, 2022.
  226. Task-relevant failure detection for trajectory predictors in autonomous vehicles. In CoRL, 2022.
  227. Sim-to-Lab-to-Real: Safe reinforcement learning with shielding and generalization guarantees. Artificial Intelligence, 314:103811, 2023.
  228. The safety filter: A unified view of safety-critical control in autonomous systems. arXiv preprint arXiv:2309.05837, 2023.
  229. Task-driven out-of-distribution detection with statistical guarantees for robot learning. In CoRL, 2021.
  230. Detecting rewards deterioration in episodic reinforcement learning. In ICML, 2021.
  231. Real-time out-of-distribution detection in learning-enabled cyber-physical systems. In ICCPS, 2020.
  232. A system-level view on out-of-distribution data in robotics. arXiv preprint arXiv:2212.14020, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (15)
  1. Roya Firoozi (12 papers)
  2. Johnathan Tucker (3 papers)
  3. Stephen Tian (18 papers)
  4. Anirudha Majumdar (64 papers)
  5. Jiankai Sun (53 papers)
  6. Weiyu Liu (22 papers)
  7. Yuke Zhu (134 papers)
  8. Shuran Song (110 papers)
  9. Ashish Kapoor (64 papers)
  10. Karol Hausman (56 papers)
  11. Brian Ichter (52 papers)
  12. Danny Driess (35 papers)
  13. Jiajun Wu (249 papers)
  14. Cewu Lu (203 papers)
  15. Mac Schwager (88 papers)
Citations (84)
Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com