Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Octo: An Open-Source Generalist Robot Policy (2405.12213v2)

Published 20 May 2024 in cs.RO and cs.LG

Abstract: Large policies pretrained on diverse robot datasets have the potential to transform robotic learning: instead of training new policies from scratch, such generalist robot policies may be finetuned with only a little in-domain data, yet generalize broadly. However, to be widely applicable across a range of robotic learning scenarios, environments, and tasks, such policies need to handle diverse sensors and action spaces, accommodate a variety of commonly used robotic platforms, and finetune readily and efficiently to new domains. In this work, we aim to lay the groundwork for developing open-source, widely applicable, generalist policies for robotic manipulation. As a first step, we introduce Octo, a large transformer-based policy trained on 800k trajectories from the Open X-Embodiment dataset, the largest robot manipulation dataset to date. It can be instructed via language commands or goal images and can be effectively finetuned to robot setups with new sensory inputs and action spaces within a few hours on standard consumer GPUs. In experiments across 9 robotic platforms, we demonstrate that Octo serves as a versatile policy initialization that can be effectively finetuned to new observation and action spaces. We also perform detailed ablations of design decisions for the Octo model, from architecture to training data, to guide future research on building generalist robot models.

Octo: An Open-Source Generalist Robot Policy

Introduction

Robotics is an expanding field, and the hope of having versatile robots capable of performing a multitude of tasks out-of-the-box is becoming more realistic. The paper on Octo, an Open-Source Generalist Robot Policy, makes strides toward realizing this vision by introducing a transformer-based policy pre-trained on a massive dataset. This open-source policy is designed to be both adaptable and robust, making it feasible for various robotic learning scenarios. Let's break down what makes this so interesting, and what the implications could be.

What is Octo?

Octo is a large, transformer-based policy designed for robot manipulation. It is pre-trained on 800,000 robot trajectories from the largest robot manipulation dataset to date, the Open X-Embodiment dataset. The policy can handle diverse inputs and outputs, and it's flexible: it can be adapted to various robots with different sensory inputs and action spaces.

Key Features:

  1. Versatility: Designed to work with multiple types of robots and sensors.
  2. Flexibility: Can be fine-tuned for new tasks and environments quickly.
  3. Scalability: Built using a transformer architecture, it scales well with data.
  4. Open Source: Fully open source, including weights, scripts, and dataset.

Architectural Overview

At its core, Octo uses a transformer to map inputs (like language instructions or goal images) and observations (like camera streams) to actions. The architecture is divided into three main parts:

  1. Task and Observation Tokenizers: These convert task descriptions and observations into tokens.
  2. Transformer Backbone: Processes these tokens to produce embeddings.
  3. Readout Heads: Convert embeddings into actions.

The key here is flexibility: by using a sequence of tokenized inputs, Octo can be adapted to different sensors and robot configurations without retraining large parts of the model.

Training Details

Let's talk training. Octo was trained using 25 datasets from the Open X-Embodiment collection, with key considerations to balance dataset size and diversity. The model uses a diffusion-based action prediction, which allows it to generate precise and accurate actions.

Hyperparameters and Specs:

  • Transformer Type: Similar to ViT (Vision Transformer)
  • Training Dataset: 800k trajectories
  • Batch Size: 2048
  • Training Time: 14 hours on a TPU v4-128 pod

Experimental Results

Zero-Shot Performance

Octo was tested on several tasks across different robot setups immediately after pre-training, without any additional task-specific training:

  • Success Rate: On average, Octo had a 29% higher success rate compared to RT-1-X, a previous state-of-the-art openly available generalist policy.
  • Task Examples: Tasks varied from tabletop picking and placing to more complex tasks like opening drawers.

RT-2-X, another competitor with a 55 billion parameter model, was also tested. Octo performed similarly, demonstrating the efficiency of its architecture despite being a more lightweight model.

Finetuning Performance

One major capability of Octo is its flexibility for finetuning to new setups. Octo was finetuned for different robotic tasks using an average of 100 target demonstrations across various domains:

  • Finetuning Time: Less than 5 hours on an NVIDIA A5000 GPU.
  • Success Rates: On average, Octo outperformed baseline methods (both from scratch and using pre-trained visual representations) by 52%.

Tasks included, but were not limited to:

  • Precision Handling: Tasks like peg insertion requiring force/torque inputs.
  • Novel Robot Control: Adapting to new robots not included in pre-training.

Implications and Speculations

The introduction of Octo presents several potential impacts on both practical and theoretical fronts:

  1. Practical Applications:
    • Reduced Data Collection Needs: Fine-tuning large pre-trained models can significantly cut down the amount of new data needed for training.
    • Efficient Multitasking: Versatile models like Octo could be deployed in scenarios requiring a multitude of tasks without needing extensive reconfiguration or retraining.
  2. Theoretical Insights:
    • Scalable Training: The success of transformer architectures in robotic policies could inspire re-evaluation of traditional policy architectures.
    • Generalization: This work demonstrates how large-scale pre-training can help in generalizing to new tasks and setups, an area to be further explored.

Future Directions

While Octo is a substantial step forward, there are areas ripe for further development:

  • Enhanced Modalities: Improving wrist camera and proprioceptive input integration.
  • Larger Data Sets: Incorporating more diverse and larger datasets could potentially yield even more robust policies.
  • Broader Robot Varieties: Expanding beyond single and dual-arm manipulators to encompass mobile robots and other configurations.

Conclusion

Octo represents a significant advancement in creating versatile, adaptable robot policies. With its open-source nature, it provides a valuable resource for the robotics community to build upon, fostering further innovation and practical applications.

For more details and to access the Octo model and resources, you can visit their website.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (103)
  1. Scale AI. Introducing scale’s automotive foundation model, 2023. URL https://scale.com/blog/afm1.
  2. Hindsight experience replay. In NeurIPS, 2017.
  3. Human-to-robot imitation in the wild. arXiv preprint arXiv:2207.09450, 2022.
  4. Affordances from human videos as a versatile representation for robotics. In CVPR, 2023.
  5. Hydra: Hybrid robot actions for imitation learning. arxiv, 2023.
  6. Roboagent: Generalization and efficiency in robot manipulation via semantic augmentations and action chunking. arXiv preprint arXiv:2309.01918, 2023.
  7. Zero-shot robotic manipulation with pretrained image-editing diffusion models. arXiv preprint arXiv:2310.10639, 2023.
  8. End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316, 2016.
  9. Robocat: A self-improving foundation agent for robotic manipulation. arXiv preprint arXiv:2306.11706, 2023.
  10. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022.
  11. Do as i can, not as i say: Grounding language in robotic affordances. In Conference on Robot Learning, pages 287–318. PMLR, 2023.
  12. Scaling data-driven robotics with reward sketching and batch reinforcement learning. arXiv preprint arXiv:1909.12200, 2019.
  13. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020.
  14. Berkeley UR5 demonstration dataset. https://sites.google.com/view/berkeley-ur5/home.
  15. Vision-language models provide promptable representations for reinforcement learning. arXiv preprint arXiv:2402.02651, 2024.
  16. Genaug: Retargeting behaviors to unseen situations via generative augmentation. arXiv preprint arXiv:2302.06671, 2023.
  17. Diffusion policy: Visuomotor policy learning via action diffusion. In Proceedings of Robotics: Science and Systems (RSS), 2023.
  18. From play to policy: Conditional behavior generation from uncurated robot data. In The Eleventh International Conference on Learning Representations, 2022.
  19. Robonet: Large-scale multi-robot learning. In Conference on Robot Learning, pages 885–897. PMLR, 2020.
  20. CLVR jaco play dataset, 2023. URL https://github.com/clvrai/clvr_jaco_play_dataset.
  21. Causal confusion in imitation learning. NeurIPS, 2019.
  22. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  23. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023.
  24. Behavior retrieval: Few-shot imitation learning by querying unlabeled datasets. ArXiv, abs/2304.08742, 2023. URL https://api.semanticscholar.org/CorpusID:258186973.
  25. Bridge data: Boosting generalization of robotic skills with cross-domain datasets. arXiv preprint arXiv:2109.13396, 2021.
  26. Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot. Towards Generalist Robots: Learning Paradigms for Scalable Skill Acquisition@ CoRL2023, 3:5, 2023.
  27. Deep visual foresight for planning robot motion. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 2786–2793. IEEE, 2017.
  28. Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation. arXiv preprint arXiv:2401.02117, 2024.
  29. Are we ready for autonomous driving? the kitti vision benchmark suite. In 2012 IEEE conference on computer vision and pattern recognition, pages 3354–3361. IEEE, 2012.
  30. Robot learning in homes: Improving generalization and reducing dataset bias. Advances in neural information processing systems, 31, 2018.
  31. Scaling up and distilling down: Language-guided robot skill acquisition. In Conference on Robot Learning, pages 3766–3777. PMLR, 2023.
  32. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  33. Furniturebench: Reproducible real-world benchmark for long-horizon complex manipulation. In Robotics: Science and Systems, 2023.
  34. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  35. Gaia-1: A generative world model for autonomous driving, 2023.
  36. Visual language maps for robot navigation. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 10608–10615. IEEE, 2023a.
  37. Audio visual language maps for robot navigation. In Proceedings of the International Symposium on Experimental Robotics (ISER), Chiang Mai, Thailand, 2023b.
  38. Voxposer: Composable 3d value maps for robotic manipulation with language models. arXiv preprint arXiv:2307.05973, 2023c.
  39. Bc-z: Zero-shot task generalization with robotic imitation learning. In Conference on Robot Learning, pages 991–1002. PMLR, 2022.
  40. VIMA: Robot manipulation with multimodal prompts. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 14975–15022. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/jiang23b.html.
  41. QT-Opt: Scalable deep reinforcement learning for vision-based robotic manipulation. arXiv preprint arXiv:1806.10293, 2018.
  42. Scaling up multi-task robotic reinforcement learning. In 5th Annual Conference on Robot Learning, 2021.
  43. Socially compliant navigation dataset (scand): A large-scale dataset of demonstrations for social navigation. IEEE Robotics and Automation Letters, 7(4):11807–11814, 2022.
  44. Segment Anything, April 2023.
  45. Robohive: A unified framework for robot learning. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL https://openreview.net/forum?id=0H5fRQcpQ7.
  46. Language models as zero-shot trajectory generators. arXiv preprint arXiv:2310.11604, 2023.
  47. End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research, 17(1):1334–1373, 2016.
  48. Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection. The International journal of robotics research, 37(4-5):421–436, 2018.
  49. Polymetis. https://facebookresearch.github.io/fairo/polymetis/, 2021.
  50. Robot learning on the job: Human-in-the-loop autonomy and learning during deployment. In Robotics: Science and Systems (RSS), 2023.
  51. Decoupled weight decay regularization. In International Conference on Learning Representations, 2018.
  52. Multi-stage cable routing through hierarchical imitation learning. arXiv preprint arXiv:2307.08927, 2023a.
  53. FMB: A functional manipulation benchmark for generalizable robotic learning. https://functional-manipulation-benchmark.github.io, 2023b.
  54. Language conditioned imitation learning over unstructured data. In RSS, 2021.
  55. Interactive language: Talking to robots in real time. IEEE Robotics and Automation Letters, 2023.
  56. Where are we in the search for an artificial visual cortex for embodied intelligence? 2023a.
  57. Where are we in the search for an artificial visual cortex for embodied intelligence? arXiv preprint arXiv:2303.18240, 2023b.
  58. RoboTurk: A crowdsourcing platform for robotic skill learning through imitation. CoRR, abs/1811.02790, 2018a. URL http://arxiv.org/abs/1811.02790.
  59. Roboturk: A crowdsourcing platform for robotic skill learning through imitation. In Conference on Robot Learning, pages 879–893. PMLR, 2018b.
  60. Mimicgen: A data generation system for scalable robot learning using human demonstrations. In 7th Annual Conference on Robot Learning, 2023.
  61. What matters in language conditioned robotic imitation learning over unstructured data. IEEE Robotics and Automation Letters, 7(4):11205–11212, 2022a.
  62. Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. IEEE Robotics and Automation Letters (RA-L), 7(3):7327–7334, 2022b.
  63. Grounding language with visual affordances over unstructured data. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), London, UK, 2023.
  64. Structured world models from human videos. CoRL, 2023.
  65. Learning and retrieval from prior data for skill-based imitation learning. In Conference on Robot Learning (CoRL), 2022.
  66. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pages 8162–8171. PMLR, 2021.
  67. Open X-Embodiment: Robotic learning datasets and RT-X models. https://arxiv.org/abs/2310.08864, 2023.
  68. OpenAI. GPT-4 Technical Report, March 2023.
  69. The surprising effectiveness of representation learning for visual imitation, 2021.
  70. Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018.
  71. Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours. In 2016 IEEE international conference on robotics and automation (ICRA), pages 3406–3413. IEEE, 2016.
  72. Shared Control Templates for Assistive Robotics. In 2020 IEEE International Conference on Robotics and Automation (ICRA), page 7, Paris, France, 2020.
  73. Robot learning with sensorimotor pre-training. Conference on Robot Learning, 2023.
  74. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020. URL http://jmlr.org/papers/v21/20-074.html.
  75. A generalist agent. Transactions on Machine Learning Research, 2022.
  76. High-Resolution Image Synthesis with Latent Diffusion Models, April 2022.
  77. Latent plans for task agnostic offline reinforcement learning. In Proceedings of the 6th Conference on Robot Learning (CoRL), 2022.
  78. Multi-resolution sensing for real-time control with vision-language models. In 7th Annual Conference on Robot Learning, 2023. URL https://openreview.net/forum?id=WuBv9-IGDUA.
  79. On bringing robots home, 2023.
  80. Gnm: A general navigation model to drive any robot. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 7226–7233. IEEE, 2023a.
  81. ViNT: A foundation model for visual navigation. In 7th Annual Conference on Robot Learning, 2023b. URL https://arxiv.org/abs/2306.14846.
  82. MUTEX: Learning unified policies from multimodal task specifications. In 7th Annual Conference on Robot Learning, 2023c. URL https://openreview.net/forum?id=PwqiqaaEzJ.
  83. Perceiver-actor: A multi-task transformer for robotic manipulation. In Conference on Robot Learning, pages 785–799. PMLR, 2023.
  84. Progprompt: Generating situated robot task plans using large language models. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 11523–11530. IEEE, 2023.
  85. Nomad: Goal masked diffusion policies for navigation and exploration. arXiv preprint arXiv:2310.07896, 2023.
  86. Open-world object manipulation using pre-trained vision-language models. In 7th Annual Conference on Robot Learning, 2023.
  87. Scalability in perception for autonomous driving: Waymo open dataset. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2446–2454, 2020.
  88. LLaMA: Open and Efficient Foundation Language Models, February 2023.
  89. Tartandrive: A large-scale dataset for learning off-road dynamics models. In 2022 International Conference on Robotics and Automation (ICRA), pages 2546–2552. IEEE, 2022.
  90. Bridgedata v2: A dataset for robot learning at scale, 2023.
  91. Wayve. Lingo: Natural language for autonomous driving, 2023. URL https://wayve.ai/thinking/lingo-natural-language-autonomous-driving/.
  92. Masked trajectory models for prediction, representation, and control. International Conference on Machine Learning, 2023.
  93. ucsd kitchens Dataset. August 2023.
  94. Polybot: Training one policy across robots while embracing variability. In 7th Annual Conference on Robot Learning, 2023. URL https://openreview.net/forum?id=HEIRj51lcS.
  95. Bdd100k: A diverse driving dataset for heterogeneous multitask learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2636–2645, 2020.
  96. Scaling robot learning with semantically imagined experience. arXiv preprint arXiv:2302.11550, 2023.
  97. Scaling vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12104–12113, 2022.
  98. Learning fine-grained bimanual manipulation with low-cost hardware. arXiv preprint arXiv:2304.13705, 2023.
  99. Train offline, test online: A real robot learning benchmark, 2023.
  100. Fanuc manipulation: A dataset for learning-based manipulation with fanuc mate 200id robot. 2023a.
  101. Bottom-up skill discovery from unsegmented demonstrations for long-horizon robot manipulation. IEEE Robotics and Automation Letters, 7(2):4126–4133, 2022.
  102. Viola: Imitation learning for vision-based manipulation with object proposal priors, 2023b.
  103. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In 7th Annual Conference on Robot Learning, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (19)
  1. Octo Model Team (1 paper)
  2. Dibya Ghosh (20 papers)
  3. Homer Walke (14 papers)
  4. Karl Pertsch (35 papers)
  5. Kevin Black (29 papers)
  6. Oier Mees (32 papers)
  7. Sudeep Dasari (19 papers)
  8. Joey Hejna (19 papers)
  9. Tobias Kreiman (6 papers)
  10. Charles Xu (12 papers)
  11. Jianlan Luo (22 papers)
  12. You Liang Tan (9 papers)
  13. Pannag Sanketi (13 papers)
  14. Quan Vuong (41 papers)
  15. Ted Xiao (40 papers)
  16. Dorsa Sadigh (162 papers)
  17. Chelsea Finn (264 papers)
  18. Sergey Levine (531 papers)
  19. Lawrence Yunliang Chen (15 papers)
Citations (186)
Youtube Logo Streamline Icon: https://streamlinehq.com