Emergent Mind

Toward General-Purpose Robots via Foundation Models: A Survey and Meta-Analysis

(2312.08782)
Published Dec 14, 2023 in cs.RO , cs.AI , cs.CV , and cs.LG

Abstract

Building general-purpose robots that can operate seamlessly, in any environment, with any object, and utilizing various skills to complete diverse tasks has been a long-standing goal in Artificial Intelligence. Unfortunately, however, most existing robotic systems have been constrained - having been designed for specific tasks, trained on specific datasets, and deployed within specific environments. These systems usually require extensively-labeled data, rely on task-specific models, have numerous generalization issues when deployed in real-world scenarios, and struggle to remain robust to distribution shifts. Motivated by the impressive open-set performance and content generation capabilities of web-scale, large-capacity pre-trained models (i.e., foundation models) in research fields such as NLP and Computer Vision (CV), we devote this survey to exploring (i) how these existing foundation models from NLP and CV can be applied to the field of robotics, and also exploring (ii) what a robotics-specific foundation model would look like. We begin by providing an overview of what constitutes a conventional robotic system and the fundamental barriers to making it universally applicable. Next, we establish a taxonomy to discuss current work exploring ways to leverage existing foundation models for robotics and develop ones catered to robotics. Finally, we discuss key challenges and promising future directions in using foundation models for enabling general-purpose robotic systems. We encourage readers to view our living GitHub repository of resources, including papers reviewed in this survey as well as related projects and repositories for developing foundation models for robotics.

Overview

  • Foundation models in robotics aim to overcome data scarcity and enhance generalizability, leveraging successes from NLP and CV.

  • These models can potentially ease the creation of adaptable, intelligent robots capable of operating in varied environments.

  • Challenges like task specification, model dependency, and safety are being addressed by the properties of foundation models.

  • Research shows a focus on pick-and-place tasks, with the need for better simulations, real-world data, and unified performance benchmarks.

  • Future exploration includes enhanced grounding, continual learning, cross-embodiment adaptability, and hardware innovations.

Evolution of Robotics with Foundation Models

Introduction to Foundation Models in Robotics

The field of robotics has long been focused on developing systems shaped for particular tasks, trained on specific datasets, and limited to defined environments. These systems often suffer from challenges such as data scarcity, lack of generalization, and robustness when faced with real-world scenarios. Encouraged by the success of foundational models in NLP and computer vision (CV), researchers are now exploring their application to robotics. Foundation models like LLMs, Vision Foundation Models (VFMs), and others possess qualities that align well with the vision for general-purpose robots—those that can seamlessly operate across various tasks and environments without extensive retraining.

Robotics and Foundation Models

Robotics systems comprise several core functionalities, including perception, decision-making and planning, and action generation. Each of these functionalities presents its own set of challenges. For example, perception systems need varied data to understand scenes and objects, while planning and control must adapt to new environments. The entry of foundation models into this domain aims at leveraging their strong generalization and learning abilities to address these hurdles, potentially smoothing the path toward truly adaptable and intelligent robotic systems.

Addressing Core Robotics Challenges

Foundation models shine brightly when examining their impact on classical challenges in robotics:

  • Generalization: Taking cues from the human brain's modularity and the adaptability seen in nature, foundation models offer a promising route to achieve a similar level of function-centric generalization in robotics.
  • Data Scarcity: Through the ability to generate synthetic data and learn from limited examples, foundation models are positioned to tackle the constraints imposed by the requirement for large and diverse datasets.
  • Model Dependency: Reducing the reliance on meticulously crafted models for the environment and robot dynamics can be advanced with model-agnostic foundation models.
  • Task Specification: Foundation models open up avenues for natural and intuitive ways of specifying goals for robotic tasks, such as through language, images, or code.
  • Uncertainty and Safety: Ensuring safe operation and managing uncertainty remain underexplored, but are areas where foundation models could potentially offer rigorous frameworks and contributions.

Research Methodologies and Evaluations

Numerous studies have explored applying foundation models to various tasks, leading to several observations:

  • Task Focus: There's a notable skew toward general pick-and-place tasks. The translation from text to motion, particularly with LLMs, has been less ventured into, especially for complex tasks like dexterous manipulation.
  • Simulation and Real-World Data: The balance between simulations and real-world data is critical. Robust simulators enable vast data generation, yet may lack the diversity and richness of real-world data, highlighting the need for ongoing efforts in both areas.
  • Performance and Benchmarking: Advancements are being made in testing foundation models in diverse tasks, but a unified approach to performance measurement and benchmarking is yet something to develop.

Future Directions in Foundation Models and Robotics

Looking ahead, several areas are ripe for exploration:

  • Enhanced Grounding: Developing a profound connection between model output and physical robotic actions remains a fruitful avenue for research.
  • Continual Learning: Adapting to changing environments and tasks without forgetting past learning is a frontier yet to be fully conquered by robotic foundation models.
  • Hardware Innovations: Complementary hardware innovations are necessary to enrich the data available for training foundation models and to expand the conceptual learning space.
  • Cross-Embodiment Adaptability: Learning control policies that are adaptable to diverse physical embodiments is a critical step toward creating more universal robotic systems.

The application of foundation models to robotics holds the promise of achieving a higher level of autonomy, adaptability, and intelligence in robotic systems. As the field progresses, the blend of robust AI models and robotics could usher in a new era of smart, versatile machines ready to meet the complexities and unpredictability of the real world.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

References
  1. Vision meets robotics: The kitti dataset. International Journal of Robotics Research (IJRR)
  2. Real-time semantic mapping for autonomous off-road navigation. In Field and Service Robotics, pages 335–350. Springer
  3. Yale-cmu-berkeley dataset for robotic manipulation research. In International Journal of Robotics Research, page 261 – 268
  4. Decaf: A deep convolutional activation feature for generic visual recognition. In ICML
  5. Adversarial discriminative domain adaptation. In CVPR
  6. Learning domain-independent planning heuristics with hypergraph networks. In Proceedings of the International Conference on Automated Planning and Scheduling, volume 30, pages 574–584
  7. Learning value functions with relational state representations for guiding task-and-motion planning. In Conference on Robot Learning, pages 955–968. PMLR
  8. Aggressive driving with model predictive path integral control. In ICRA
  9. Motion planning networks: Bridging the gap between learning-based and classical motion planners. IEEE Transactions on Robotics, pages 1–9
  10. Motion policy networks. In Proceedings of the 6th Conference on Robot Learning (CoRL)
  11. Learning agile robotic locomotion skills by imitating animals. In RSS
  12. End-to-end training of deep visuomotor policies. In Journal of Machine Learning Research
  13. Learning agile and dynamic motor skills for legged robots. In Science Robotics, 30 Jan 2019.
  14. Deep dynamics models for learning dexterous manipulation. In CoRL
  15. MT-Opt: Continuous Multi-Task Robotic Reinforcement Learning at Scale
  16. BC-Z: Zero-Shot Task Generalization with Robotic Imitation Learning. In 5th Annual Conference on Robot Learning
  17. Language models are few-shot learners
  18. Hierarchical Text-Conditional Image Generation with CLIP Latents
  19. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding
  20. Emerging properties in self-supervised vision transformers. In Proceedings of the International Conference on Computer Vision (ICCV)
  21. Segment Anything
  22. Dinov2: Learning robust visual features without supervision
  23. On the Opportunities and Risks of Foundation Models
  24. Ahn et. al. Do as i can, not as i say: Grounding language in robotic affordances. In CoRL
  25. Open-vocabulary Queryable Scene Representations for Real World Planning
  26. Saytap: Language to quadrupedal locomotion
  27. RT-1: Robotics Transformer for Real-World Control at Scale
  28. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
  29. ViNT: A Foundation Model for Visual Navigation
  30. A generalist agent. In Transactions on Machine Learning Research (TMLR), November 10
  31. PaLM-E: An Embodied Multimodal Language Model
  32. Challenges and Applications of Large Language Models
  33. Text-to-image Diffusion Models in Generative AI: A Survey
  34. Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond
  35. Foundation Models for Decision Making: Problems, Methods, and Opportunities
  36. A survey on segment anything model (sam): Vision foundation model meets prompt engineering
  37. Foundational models defining a new era in vision: A survey and outlook
  38. A survey of vision-language pre-trained models. IJCAI-2022 survey track
  39. A Systematic Survey of Prompt Engineering on Vision-Language Foundation Models
  40. A Survey on Large Language Model based Autonomous Agents
  41. The development of llms for embodied navigation. In IEEE/ASME TRANSACTIONS ON MECHATRONICS, volume 1, Sept. 2023.
  42. Anirudha Majumdar. Robotics: An idiosyncratic snapshot in the age of llms, 8 2023.
  43. Robot learning in the era of foundation models: A survey
  44. Foundation models in robotics: Applications, challenges, and the future
  45. Vincent Vanhoucke. The end-to-end false dichotomy: Roboticists arguing lego vs. playmo. Medium, October 28 2018.
  46. Yuke Zhu. Cs391r: Robot learning
  47. Core challenges in embodied vision-language planning. Journal of Artificial Intelligence Research, 74:459–515
  48. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS
  49. YOLOv3: An Incremental Improvement
  50. Airobject: A temporally evolving graph embedding for object identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8407–8416
  51. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In NIPS
  52. NeRF-SLAM: Real-Time Dense Monocular SLAM with Neural Radiance Fields
  53. Learning transferable visual models from natural language supervision. In ICML
  54. Orb-slam: a versatile and accurate monocular slam system. IEEE transactions on robotics, 31(5):1147–1163
  55. Direct sparse odometry. IEEE transactions on pattern analysis and machine intelligence, 40(3):611–625
  56. Ldso: Direct sparse odometry with loop closure. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 2198–2204. IEEE
  57. Lsd-slam: Large-scale direct monocular slam. In European conference on computer vision, pages 834–849. Springer
  58. Direct sparse mapping. IEEE Transactions on Robotics
  59. Tp-tio: A robust thermal-inertial odometry with deep thermalpoint. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4505–4512. IEEE
  60. Real-time loop closure in 2d lidar slam. In 2016 IEEE international conference on robotics and automation (ICRA), pages 1271–1278. IEEE
  61. A flexible and scalable slam system with full 3d motion estimation. In Proc. IEEE International Symposium on Safety, Security and Rescue Robotics (SSRR). IEEE, November 2011.
  62. Ji Zhang and Sanjiv Singh. Loam: Lidar odometry and mapping in real-time. In Robotics: Science and systems, volume 2, pages 1–9. Berkeley, CA
  63. Cerberus: Low-drift visual-inertial-leg odometry for agile locomotion. ICRA
  64. Imu preintegration on manifold for efficient visual-inertial maximum-a-posteriori estimation. Technical report, EPFL
  65. Ji Zhang and Sanjiv Singh. Visual-lidar odometry and mapping: Low-drift, robust, and fast. In ICRA
  66. Limo: Lidar-monocular visual odometry. In 2018 IEEE/RSJ international conference on intelligent robots and systems (IROS), pages 7872–7879. IEEE
  67. VIRAL SLAM: Tightly Coupled Camera-IMU-UWB-Lidar SLAM
  68. Lio-sam: Tightly-coupled lidar inertial odometry via smoothing and mapping. In 2020 IEEE/RSJ international conference on intelligent robots and systems (IROS), pages 5135–5142. IEEE
  69. Super odometry: Imu-centric lidar-visual-inertial estimator for challenging environments. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 8729–8736. IEEE
  70. Deepvo: Towards end-to-end visual odometry with deep recurrent convolutional neural networks. In 2017 IEEE international conference on robotics and automation (ICRA), pages 2043–2050. IEEE
  71. Tartanvo: A generalizable learning-based vo. In CoRL
  72. Unsupervised learning of depth and ego-motion from video. In CVPR
  73. Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras. In NeurIPS
  74. NICER-SLAM: Neural Implicit Scene Encoding for RGB SLAM
  75. SplaTAM: Splat, Track & Map 3D Gaussians for Dense RGB-D SLAM
  76. The curious robot: Learning visual representations via physical interactions. In ECCV
  77. Learning to look around: Intelligently exploring unseen environments for unknown tasks. In CVPR
  78. NeU-NBV: Next Best View Planning Using Uncertainty Estimation in Image-Based Neural Rendering
  79. Off-Policy Evaluation with Online Adaptation for Robot Exploration in Challenging Environments. In IEEE Robotics and Automation Letters (RA-L)
  80. A formal basis for the heuristic determination of minimum cost paths. IEEE Transactions on Systems Science and Cybernetics, 4(2):100–107
  81. Anytime safe interval path planning for dynamic environments. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 4708–4715
  82. Multi-heuristic A. In Dieter Fox, Lydia E. Kavraki, and Hanna Kurniawati, editors, Robotics: Science and Systems X, University of California, Berkeley, USA, July 12-16, 2014
  83. Path planning for non-circular micro aerial vehicles in constrained environments. In 2013 IEEE International Conference on Robotics and Automation, pages 3933–3940
  84. Single- and dual-arm motion planning with heuristic search. Int. J. Robotics Res., 33(2):305–320
  85. Steven M LaValle et al. Rapidly-exploring random trees: A new tool for path planning. Technical report, Iowa State University
  86. Rrt-connect: An efficient approach to single-query path planning. In Proceedings 2000 ICRA. Millennium Conference. IEEE International Conference on Robotics and Automation. Symposia Proceedings (Cat. No.00CH37065), volume 2, pages 995–1001 vol.2
  87. Probabilistic roadmaps for path planning in high-dimensional configuration spaces. IEEE Transactions on Robotics and Automation, 12(4):566–580
  88. Sampling-based algorithms for optimal motion planning. The international journal of robotics research, 30(7):846–894
  89. Batch informed trees (bit): Sampling-based optimal planning via the heuristically guided search of implicit random geometric graphs. In 2015 IEEE international conference on robotics and automation (ICRA), pages 3067–3074. IEEE
  90. Regionally accelerated batch informed trees (rabit): A framework to integrate local information into optimal path planning. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pages 4207–4214. IEEE
  91. Automated Planning and Acting. Cambridge University Press
  92. Integrated Task and Motion Planning
  93. Offline reinforcement learning for visual navigation. In CoRL
  94. Fastrlap: A system for learning high-speed driving via deep rl and autonomous practicing. arXiv pre-print
  95. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. In NeurIPS 2020 Tutorial
  96. PEORL: Integrating Symbolic Planning and Hierarchical Reinforcement Learning for Robust Decision-Making
  97. Task-motion planning with reinforcement learning for adaptable mobile service robots. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 7529–7534. IEEE
  98. Active exploration for learning symbolic representations. Advances in Neural Information Processing Systems, 30
  99. From skills to symbols: Learning symbolic representations for abstract high-level planning. Journal of Artificial Intelligence Research, 61:215–289
  100. A survey of motion planning and control techniques for self-driving urban vehicles. IEEE Transactions on intelligent vehicles, 1(1):33–55
  101. Distribution-aware goal prediction and conformant model-based planning for safe autonomous driving. ICML Workshop on Safe Learning for Autonomous Driving
  102. Learn-to-race: A multimodal control environment for autonomous racing. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9793–9802
  103. Learn-to-race challenge 2022: Benchmarking safe learning and cross-domain generalisation in autonomous racing. ICML Workshop on Safe Learning for Autonomous Driving
  104. Model predictive path integral control: From theory to parallel computation. Journal of Guidance, Control, and Dynamics, 40(2):344–357
  105. Robot learning from demonstration. In ICML
  106. Reinforcement learning in robotics: A survey. The International Journal of Robotics Research
  107. Imagenet classification with deep convolutional neural networks. In NIPS
  108. Playing atari with deep reinforcement learning. In NIPS Deep Learning Workshop
  109. Mastering the game of go with deep neural networks and tree search. In Nature
  110. Deep drone acrobatics. In Proceedings of Robotics: Science and Systems, Corvalis, Oregon, USA, July 2020.
  111. Learning Hand-Eye Coordination for Robotic Grasping with Deep Learning and Large-Scale Data Collection
  112. Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation. In CoRL
  113. Learning quadrupedal locomotion over challenging terrain. In Science Robotics, 21 Oct 2020.
  114. A reduction of imitation learning and structured prediction to no-regret online learning. In AISTATS
  115. Apprenticeship learning via inverse reinforcement learning. In ICML
  116. Maximum entropy inverse reinforcement learning. In AAAI
  117. Generative adversarial networks. In NIPS
  118. Generative adversarial imitation learning. In NIPS
  119. Agile autonomous driving using end-to-end deep imitation learning. In RSS
  120. Learning agile skills via adversarial imitation of rough partial demonstrations. In CoRL
  121. Reinforcement Learning: An Introduction , second edition. The MIT Press
  122. Safe Autonomous Racing via Approximate Reachability on Ego-vision
  123. Deepak Pathak Zipeng Fu, Xuxin Cheng. Deep whole-body control: Learning a unified policy for manipulation and locomotion. In CoRL
  124. Extreme Parkour with Legged Robots
  125. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Neural Information Processing Systems
  126. Unsupervised learning for physical interaction through video prediction. In NIPS
  127. Deep visual foresight for planning robot motion. In ICRA
  128. S. Levine I. Kostrikov, A. Nair. Offline reinforcement learning with implicit q-learning. In ICLR
  129. Mopo: Model-based offline policy optimization. In NeurIPS
  130. Plas: Latent action space for offline reinforcement learning. In Conference on Robot Learning (CoRL)
  131. Offline reinforcement learning from images with latent space models. In Proceedings of Machine Learning Research, volume 144:1–15
  132. Masked autoencoders are scalable vision learners. In CVPR
  133. What do self-supervised vision transformers learn? In ICLR
  134. Objectives Matter: Understanding the Impact of Self-Supervised Objectives on Vision Transformer Representations
  135. Conceptfusion: Open-set multimodal 3d mapping. RSS
  136. Deep ViT Features as Dense Visual Descriptors
  137. Open-vocabulary panoptic segmentation with text-to-image diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2955–2966
  138. Yann LeCun. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. Open Review, 62
  139. InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions
  140. Bootstrap your own latent: A new approach to self-supervised learning. In NeurIPS
  141. Self-supervised learning from images with a joint-embedding predictive architecture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15619–15629
  142. MC-JEPA: A Joint-Embedding Predictive Architecture for Self-Supervised Learning of Motion and Content Features
  143. Denoising Diffusion Probabilistic Models
  144. GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models
  145. Improving language understanding by generative pre-training. Technical report, OpenAI
  146. Language models are unsupervised multitask learners. Technical report, OpenAI
  147. Bert: Pre-training of deep bidirectional transformers for language understanding
  148. Harnessing the power of llms in practice: A survey on chatgpt and beyond
  149. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30
  150. Open-vocabulary object detection via vision and language knowledge distillation. In ICLR
  151. Language-driven semantic segmentation. In ICLR
  152. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
  153. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning
  154. Vl-beit: Generative vision-language pretraining
  155. Flamingo: a Visual Language Model for Few-Shot Learning
  156. GIT: A generative image-to-text transformer for vision and language. Transactions on Machine Learning Research
  157. BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA
  158. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
  159. GPT-4 Technical Report
  160. Next-gpt: Any-to-any multimodal llm
  161. Audiogpt: Understanding and generating speech, music, sound, and talking head
  162. Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities
  163. Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding
  164. What went wrong? closing the sim-to-real gap via differentiable causal discovery. In 7th Annual Conference on Robot Learning
  165. Jonathan Francis. Knowledge-enhanced Representation Learning for Multiview Context Understanding. PhD thesis, Carnegie Mellon University
  166. Transferring implicit knowledge of non-visual object properties across heterogeneous robot morphologies. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 11315–11321. IEEE
  167. Cross-Tool and Cross-Behavior Perceptual Knowledge Transfer for Grounded Object Recognition
  168. Scalability in perception for autonomous driving: Waymo open dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
  169. Deep rl at scale: Sorting waste in office buildings with a fleet of mobile manipulators. In Robotics: Science and Systems (RSS)
  170. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ international conference on intelligent robots and systems, pages 5026–5033. IEEE
  171. Isaac gym: High performance gpu-based physics simulation for robot learning
  172. Orbit: A unified simulation framework for interactive robot learning environments
  173. Habitat: A platform for embodied ai research. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9339–9347
  174. Habitat 2.0: Training home assistants to rearrange their habitat
  175. Habitat 3.0: A co-habitat for humans, avatars and robots
  176. Embodiment Collaboration. Open x-embodiment: Robotic learning datasets and rt-x models
  177. Bayesian multi-task learning mpc for robotic mobile manipulation
  178. TARE: A Hierarchical Framework for Efficiently Exploring Complex 3D Environments. In ICRA
  179. Provably constant-time planning and replanning for real-time grasping objects off a conveyor belt. In RSS
  180. Manipulation planning among movable obstacles using physics-based adaptive motion primitives. In 2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, May 2021.
  181. Can foundation models perform zero-shot task specification for robot manipulation? In Learning for Dynamics and Control Conference, pages 893–905. PMLR
  182. Eureka: Human-level reward design via coding large language models. In 2nd Workshop on Language and Robot Learning: Language as Grounding
  183. A survey of uncertainty in deep neural networks. Artificial Intelligence Review, 56(Suppl 1):1513–1589
  184. Interpretable deep learning: Interpretation, interpretability, trustworthiness, and beyond. Knowledge and Information Systems, 64(12):3197–3234
  185. Just ask: An interactive learning framework for vision and language navigation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 2459–2466
  186. A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification
  187. Robots that ask for help: Uncertainty alignment for large language model planners. In 7th Annual Conference on Robot Learning
  188. Data-driven safety filters: Hamilton-jacobi reachability, control barrier functions, and predictive methods for uncertain systems. IEEE Control Systems Magazine, 43(5):137–177
  189. The Safety Filter: A Unified View of Safety-Critical Control in Autonomous Systems
  190. Control barrier functions: Theory and applications. In 2019 18th European control conference (ECC), pages 3420–3431. IEEE
  191. Hamilton-jacobi reachability: A brief overview and recent advances. In 2017 IEEE 56th Annual Conference on Decision and Control (CDC), pages 2242–2253. IEEE
  192. Backpropagation through signal temporal logic specifications: Infusing logical structure into gradient-based methods. The International Journal of Robotics Research, 42(6):356–370
  193. A Review of Safe Reinforcement Learning: Methods, Theory and Applications
  194. Safe control with learned certificates: A survey of neural lyapunov, barrier, and contraction methods for robotics and control. IEEE Transactions on Robotics
  195. Robocat: A self-improving foundation agent for robotic manipulation
  196. Cliport: What and where pathways for robotic manipulation. In CoRL
  197. Distilled feature fields enable few-shot language-guided manipulation. CoRL
  198. Multi-task real robot learning with generalizable neural feature fields. CoRL
  199. FM-Loc: Using Foundation Models for Improved Vision-based Localization
  200. MOSAIC: Learning Unified Multi-Sensory Object Property Representations for Robot Learning via Interactive Perception
  201. Visual language maps for robot navigation. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), London, UK
  202. Clip-fields: Weakly supervised semantic fields for robotic memory. In RSS
  203. ConceptFusion: Open-set Multimodal 3D Mapping
  204. Lm-nav: Robotic navigation with large pre-trained models of language, vision, and action. In CoRL
  205. The homerobot open vocab mobile manipulation challenge. In Thirty-seventh Conference on Neural Information Processing Systems: Competition Track
  206. Act3d: 3d feature field transformers for multi-task robotic manipulation
  207. Language-EXtended Indoor SLAM (LEXIS): A Versatile System for Real-time Visual Scene Understanding
  208. Anyloc: Towards universal visual place recognition. RA-L
  209. Foundloc: Vision-based onboard aerial localization in the wild
  210. Grounding semantic categories in behavioral interactions: Experiments with 100 objects. Robotics and Autonomous Systems, 62(5):632–645, may 2014.
  211. Learning relational object categories using behavioral exploration and multimodal perception. In International Conference on Robotics and Automation (ICRA), pages 5691–5698, Hong Kong, China, may 2014. IEEE.
  212. Learning haptic representation for manipulating deformable food objects. In Intelligent Robots and Systems (IROS), pages 638–645, Chicago, IL, USA, Sep 2014. IEEE.
  213. Sensorimotor cross-behavior knowledge transfer for grounded category recognition. In International Conference on Development and Learning and Epigenetic Robotics (ICDL-EpiRob). IEEE
  214. A framework for sensorimotor cross-perception and cross-behavior knowledge transfer for object categorization. Frontiers in Robotics and AI, 7:137
  215. Haptic knowledge transfer between heterogeneous robots using kernel manifold alignment. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE
  216. Code as Policies: Language Model Programs for Embodied Control
  217. Video Language Planning
  218. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3674–3683
  219. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents
  220. Text2Motion: From Natural Language Instructions to Feasible Plans
  221. ProgPrompt: Generating situated robot task plans using large language models
  222. Gensim: Generating robotic simulation tasks via large language models. In CoRL
  223. Pddl generators
  224. Sayplan: Grounding large language models using 3d scene graphs for scalable task planning. In 7th Annual Conference on Robot Learning
  225. Reasoning about the unseen for efficient outdoor object navigation
  226. Language to Rewards for Robotic Skill Synthesis
  227. VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models
  228. How To Not Train Your Dragon: Training-free Embodied Object Goal Navigation with Semantic Frontiers
  229. Prompt a robot to walk with large language models
  230. Guiding pretraining in reinforcement learning with large language models
  231. Reward design with language models
  232. Text2reward: Automated dense reward function generation for reinforcement learning
  233. Grounded decoding: Guiding text generation with grounded models for robot control
  234. Grounding large language models in interactive environments with online reinforcement learning
  235. Scaling up and distilling down: Language-guided robot skill acquisition. In Proceedings of the 2023 Conference on Robot Learning
  236. Robogen: Towards unleashing infinite data for automated robot learning via generative simulation
  237. Scaling Robot Learning with Semantically Imagined Experience
  238. Rt-trajectory: Robotic task generalization via hindsight trajectory sketches
  239. Zero-shot robotic manipulation with pretrained image-editing diffusion models
  240. Learning universal policies via text-guided video generation. In NeurIPS
  241. Chain-of-thought prompting elicits reasoning in large language models
  242. Faith and fate: Limits of transformers on compositionality
  243. Graph of thoughts: Solving elaborate problems with large language models
  244. Tree of thoughts: Deliberate problem solving with large language models
  245. Planning with large language models for code generation. In The Eleventh International Conference on Learning Representations
  246. Llm+p: Empowering large language models with optimal planning proficiency
  247. Inner monologue: Embodied reasoning through planning with language models
  248. Open-World Object Manipulation using Pre-trained Vision-Language Models
  249. Pre-trained language models for interactive decision-making. Advances in Neural Information Processing Systems, 35:31199–31212
  250. Lila: Language-informed latent actions. In Conference on Robot Learning, pages 1379–1390. PMLR
  251. Language Instructed Reinforcement Learning for Human-AI Coordination
  252. Interactive Language: Talking to Robots in Real Time
  253. Pi-qt-opt: Predictive information improves multi-task robotic reinforcement learning at scale. CoRL
  254. Deep rl at scale: Sorting waste in office buildings with a fleet of mobile manipulators. In RSS
  255. Q-transformer: Scalable offline reinforcement learning via autoregressive q-functions. In CoRL
  256. Pre-training for robots: Offline rl enables learning new tasks from a handful of trials. In RSS
  257. Conservative q-learning for offline reinforcement learning. In NeurIPS
  258. The unsurprising effectiveness of pre-trained vision models for control
  259. R3m: A universal visual representation for robot manipulation
  260. Real-world robot learning with masked visual pre-training. In Conference on Robot Learning, pages 416–426. PMLR
  261. Robot learning with sensorimotor pre-training
  262. On pre-training for visuo-motor control: Revisiting a learning-from-scratch baseline. In ICML
  263. Where are we in the search for an artificial visual cortex for embodied intelligence?
  264. Pali-x: On scaling up a multilingual vision and language model
  265. Affordances from human videos as a versatile representation for robotics. CVPR
  266. Gnm: A general navigation model to drive any robot. In ICRA
  267. IndoorSim-to-OutdoorReal: Learning to Navigate Outdoors without any Outdoor Experience
  268. PACT: Perception-Action Causal Transformer for Autoregressive Robotics Pre-Training
  269. ConceptGraphs: Open-Vocabulary 3D Scene Graphs for Perception and Planning
  270. Reasoning with Language Model is Planning with World Model
  271. VIMA: General Robot Manipulation with Multimodal Prompts
  272. Human-to-robot imitation in the wild. In RSS
  273. Transformers are adaptable task planners. In 6th Annual Conference on Robot Learning
  274. Modular multitask reinforcement learning with policy sketches. In International conference on machine learning, pages 166–175. PMLR
  275. Using a hand-drawn sketch to control a team of robots. Autonomous Robots, 22:399–410
  276. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837
  277. Towards Open Vocabulary Learning: A Survey
  278. Holistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference Challenges
  279. Physically grounded vision-language models for robotic manipulation. In arxiv
  280. Robonet: Large-scale multi-robot learning
  281. Bridge data: Boosting generalization of robotic skills with cross-domain datasets
  282. BridgeData V2: A Dataset for Robot Learning at Scale
  283. LEAP Hand: Low-Cost, Efficient, and Anthropomorphic Hand for Robot Learning
  284. Matterport3D: Learning from RGB-D Data in Indoor Environments
  285. Gibson env: Real-world perception for embodied agents. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 9068–9079
  286. Habitat: A Platform for Embodied AI Research. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)
  287. AI2-THOR: An Interactive 3D Environment for Visual AI
  288. Film: Following instructions in language with modular methods
  289. Airsim: High-fidelity visual and physical simulation for autonomous vehicles
  290. Google DeepMind. Mujoco 3.0. https://github.com/google-deepmind/mujoco/releases/tag/3.0.0, 2023. Accessed: [Insert date of access].

  291. Clvr jaco play dataset. https://github.com/clvrai/clvr-jaco-play-dataset

  292. Multi-stage cable routing through hierarchical imitation learning
  293. The Surprising Effectiveness of Representation Learning for Visual Imitation
  294. Berkeley ur5 demonstration dataset. https://sites.google.com/view/berkeley-ur5/home. Accessed: [Insert Date Here].

  295. Latent plans for task agnostic offline reinforcement learning. In Proceedings of the 6th Conference on Robot Learning (CoRL)
  296. Chat with the environment: Interactive multimodal perception using large language models
  297. Coppelia Robotics. Coppeliasim. https://www.coppeliarobotics.com/. Accessed: [Insert Date Here].

  298. E. Coumans and Y. Bai. Pybullet, a python module for physics simulation for games, robotics and machine learning.
  299. Instruct2act: Mapping multi-modality instructions to robotic actions with large language model
  300. Sapien: A simulated part-based interactive environment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11097–11107
  301. Where2act: From pixels to actions for articulated 3d objects
  302. Beyond pick-and-place: Tackling robotic stacking of diverse shapes
  303. Large language models as generalizable policies for embodied tasks
  304. Evaluating Large Language Models Trained on Code
  305. Transporter networks: Rearranging the visual world for robotic manipulation. Conference on Robot Learning (CoRL)
  306. Liv: Language-image representations and rewards for robotic control
  307. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning
  308. TidyBot: Personalized Robot Assistance with Large Language Models
  309. Task and Motion Planning with Large Language Models for Object Rearrangement
  310. N. Koenig and A. Howard. Design and use paradigms for gazebo, an open-source multi-robot simulator. In 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No.04CH37566), volume 3, pages 2149–2154 vol.3
  311. Habitat-matterport 3d semantics dataset
  312. Imagination-Augmented Agents for Deep Reinforcement Learning
  313. BabyAI: A Platform to Study the Sample Efficiency of Grounded Language Learning
  314. Leveraging procedural generation to benchmark reinforcement learning. In International conference on machine learning, pages 2048–2056. PMLR
  315. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279
  316. DeepMind Control Suite
  317. Grounded Decoding: Guiding Text Generation with Grounded Models for Embodied Agents
  318. Minimalistic gridworld environment for gymnasium. https://github.com/pierg/environments-rl

  319. Towards human-level bimanual dexterous manipulation with reinforcement learning. Advances in Neural Information Processing Systems, 35:5150–5163
  320. Predictive sampling: Real-time behaviour synthesis with mujoco
  321. The mit humanoid robot: Design, motion planning, and control for acrobatic behaviors. In 2020 IEEE-RAS 20th International Conference on Humanoid Robots (Humanoids), pages 1–8. IEEE
  322. Model-free safe control for zero-violation reinforcement learning. In 5th Annual Conference on Robot Learning
  323. Scalable learning of safety guarantees for autonomous systems using hamilton-jacobi reachability
  324. Safety assurances for human-robot interaction via confidence-aware game-theoretic human models. In 2022 International Conference on Robotics and Automation (ICRA), pages 11229–11235. IEEE
  325. A new concept of safety affordance map for robots object manipulation. In 2018 27th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN), pages 565–570
  326. Plug in the safety chip: Enforcing constraints for LLM-driven robot agents. In 2nd Workshop on Language and Robot Learning: Language as Grounding
  327. Modular brain networks. Annual review of psychology, 67:613–640
  328. Modular and hierarchically modular organization of brain networks. Frontiers in neuroscience, 4:200
  329. How to prompt your robot: A promptbook for manipulation skills with code as policies. In 2nd Workshop on Language and Robot Learning: Language as Grounding
  330. Keto: Learning keypoint representations for tool manipulation
  331. GIFT: Generalizable Interaction-aware Functional Tool Affordances without Labels
  332. Learning Generalizable Tool-use Skills through Trajectory Generation
  333. Shadow Robot Company. Dexterous hand series. https://www.shadowrobot.com/dexterous-hand-series/, 2023. Accessed: 2023-12-10.

  334. Improved gelsight tactile sensor for measuring geometry and slip. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, September 2017.
  335. RobotSweater: Scalable, Generalizable, and Customizable Machine-Knitted Tactile Skins for Robots
  336. Learning fine-grained bimanual manipulation with low-cost hardware
  337. Realtime qa: What’s the answer right now?
  338. DSI++: Updating Transformer Memory with New Documents
  339. An empirical investigation of the role of pre-training in lifelong learning. Journal of Machine Learning Research, 24(214):1–50
  340. Construct-vl: Data-free continual structured vl concepts learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14994–15004
  341. Continual learning for robotics: Definition, framework, learning strategies, opportunities and challenges. Information Fusion, 58:52–68, June 2020.
  342. Active incremental learning of robot movement primitives. In Sergey Levine, Vincent Vanhoucke, and Ken Goldberg, editors, Proceedings of the 1st Annual Conference on Robot Learning, volume 78 of Proceedings of Machine Learning Research, pages 37–46. PMLR, 13–15 Nov 2017.
  343. Rma: Rapid motor adaptation for legged robots. In Robotics: Science and Systems
  344. Inverse preference learning: Preference-based rl without a reward function
  345. Reinforcement learning with human feedback: Learning dynamic choices via pessimism
  346. Fine-tuning can distort pretrained features and underperform out-of-distribution. In International Conference on Learning Representations
  347. Solving rubik’s cube with a robot hand
  348. Sample efficient reinforcement learning from human feedback via active exploration
  349. Direct preference-based policy optimization without reward modeling
  350. Compositional foundation models for hierarchical planning
  351. Meta learning shared hierarchies
  352. Hierarchical foresight: Self-supervised learning of long-horizon tasks via visual subgoal generation. In International Conference on Learning Representations
  353. Slowfast networks for video recognition

Show All 353