Octo: Open-Source Generalist Robot Policy
- The paper introduces Octo, an open-source, transformer-based policy trained on 800K real robot trajectories to enable versatile robotic manipulation.
- Its diffusion-based action head and transformer backbone deliver superior performance, achieving up to 72% success on multi-robot benchmarks.
- Designed for efficient transfer, Octo supports zero-shot and few-shot finetuning with minimal in-domain data across various robot platforms.
Octo is an open-source, transformer-based generalist robot policy trained on 800,000 real robot trajectories from the Open X-Embodiment dataset, supporting a range of input modalities, robot platforms, and action spaces. Designed for broad applicability, Octo serves as a versatile, efficient foundation for robotic manipulation, enabling zero-shot and few-shot transfer across diverse robotic domains and hardware. Its architecture, empirical results, and open-source release mark a significant advance in scalable generalist robot learning (Team et al., 2024).
1. Motivation and Objectives
Robotic learning has traditionally required training new policies from scratch on specific hardware and tasks, resulting in limited generalization and significant data-collection effort. In fields such as vision and language, foundation models pretrained on large, diverse datasets (e.g., GPT-4, ViTs) have achieved broad transfer and sample efficiency, but robotics has lagged due to the challenges of aggregating large, multi-robot datasets and the prevalence of closed, inflexible models.
Octo is motivated by the need for an open-source, generalist policy that:
- Accommodates multi-view and multi-modal sensory inputs (RGB cameras, proprioception, force/torque, language, and goal images),
- Supports a variety of robotic embodiments and heterogeneous action spaces,
- Admits efficient, fully trainable finetuning to new domains with minimal in-domain data and off-the-shelf GPUs,
- Enables broad sharing, reproducibility, and rapid iteration via open code, checkpoints, and data loaders.
2. Model Architecture
Octo defines a policy mapping a history of observations and a task description (language or goal image ) to a chunk of future actions .
2.1 Tokenization and Input Handling
- Language (): Tokenized by a frozen T5-Base encoder (16 tokens).
- Images (wrist, third-person): Downsampled (wrist: , third-person: ), then patched ( via CNN), producing visual tokens.
- Proprioception/Force-Torque: Encoded by a small MLP.
- Sequence construction: Task tokens, followed by observation tokens for each timestep, are concatenated with learnable positional embeddings.
2.2 Transformer Backbone
Octo is implemented in two scale variants:
- Octo-Small: 12 layers, 0 (27M parameters)
- Octo-Base: 12 layers, 1 (93M parameters)
Causal attention with blockwise masking ensures only current and past observations are visible to each token. "Passive readout" tokens are used for action decoding, analogous to a BERT [CLS] token.
The backbone’s layer operations:
- Self-Attention:
2
- Feedforward:
3
2.3 Diffusion-Based Action Head
Action chunks of length 4 are predicted by a small MLP acting as a denoising diffusion decoder 5. Generation proceeds as follows:
- 6,
- For 7:
8
- Output 9.
Training uses the DDPM noise prediction objective: 0
Diffusion outperforms both direct MSE decoders and discrete heads.
3. Data, Pretraining, and Objectives
3.1 Pretraining Mixture
Octo is pretrained on a curated set of 800,000 episodes drawn from the Open X-Embodiment collection. This includes 25 datasets encompassing single-arm and dual-arm manipulation, friction-rich, pick-and-place, and furniture-interaction tasks. The five largest sources are Fractal (17%), Kuka QT-Opt (17%), Bridge Data (17%), BC-Z (9%), and Stanford Hydra (6%).
3.2 Training Methodology
- Hindsight goal relabeling: With probability 1, the language goal 2 is replaced by a future observation frame.
- Data augmentations: Random crops, color jitter, and normalization for images.
- Loss modes: Random dropout of language or goal-image conditioning enforces robustness to both modes.
- AdamW optimizer: learning rate 3, inverse-square-root decay, 4k warmup, weight decay 5, gradient clipping.
- Batch size: 6 on TPU v4-128 pods, Base model for 300k steps (714 hours).
4. Finetuning and Domain Adaptation
When transferring Octo to a new robot or domain, 8100 in-domain demonstrations suffice. If the sensory or action spaces differ, new tokenizers and output heads can be appended without restarting pretraining. Full-model finetuning (without freezing any backbone layers) is performed for 50k steps using the AdamW recipe with a cosine learning rate decay on a single NVIDIA A5000 GPU (24GB), taking 95 hours per domain. Octo has handled:
- Additional proprioception/force-torque for insertion tasks,
- Joint-space action heads for dual-arm robots,
- Unseen robots (e.g., ViperX, ALOHA bimanual).
5. Empirical Performance
5.1 Zero-Shot Multi-Robot Benchmark
Evaluation over three platforms (WidowX BridgeV2, UR5, RT-1) with 10-trial tasks yields:
| Model | Avg. @ 3 Robots | Δ vs. RT-1-X |
|---|---|---|
| RT-1-X | 043% | – |
| RT-2-X | 165% | n/a (closed) |
| Octo | 72% | +29 pp |
Goal-image conditioning provides up to 25 percentage points higher success than language-only on specific tasks.
5.2 Few-Shot Finetuning
Across six new domains (100 demos each):
| Model | Ins. | Coffee | Baking | Pick | Coke | Bi | Avg. |
|---|---|---|---|---|---|---|---|
| Scratch | 10% | 45% | 25% | 0% | 20% | 20% | 20% |
| VC-1 | 5% | 0% | 30% | 0% | 10% | 50% | 15% |
| Octo | 70% | 75% | 50% | 60% | 100% | 80% | 72% |
Octo outperforms next-best baselines by 52 percentage points on average.
6. Design Analysis and Ablations
6.1 Architecture and Modality Choices
- ViT-first transformer (Octo-Small) yields 83% success vs. 70% for a ResNet+Transformer.
- Using the full 25-dataset mixture achieves 83% vs. 60% for a restricted 11-dataset mix or 43% for single-robot data.
- Policy head: diffusion (83%) greatly outperforms MSE (35%) and discrete (18%) heads.
- Scaling: Octo-Tiny (10M): 250%, Octo-Small (27M): 365%, Octo-Base (93M): 472% (averaged on WidowX).
6.2 Other Findings
- History: A single previous frame is sufficient; additional frames add marginal benefit.
- Patch size: 5 visual tokens outperform 6 for fine manipulation.
- Pretrained ResNet vision or MSE heads tend to produce overly conservative actions.
- Visual grounding is essential; purely proprioceptive models suffer causal confusion.
7. Broader Implications, Limitations, and Recommendations
Octo establishes that large, flexible, transformer-based policies can serve as powerful generalist robot models, enabling data-efficient transfer to new domains. However, the fraction of wrist-camera and language-annotated data is a bottleneck (27% and 56%, respectively), suggesting that additional multimodal data collection would further strengthen generalization. Incorporating interactive or perturbed data for robustness (e.g., recovery from failure) and extending to broader settings like mobile manipulation are recommended. Model scaling, better end-to-end visuo-lingual grounding, and hybrid reinforcement/imitation objectives are noted as promising avenues.
By releasing code, training pipelines, and pretrained models (Octo-Small, Octo-Base) at https://octo-models.github.io, Octo supports transparent and reproducible research in generalist robotic manipulation (Team et al., 2024). Contemporary research (e.g., Dita (Hou et al., 25 Mar 2025)) points to the potential of further unifying diffusion with in-context transformer architectures, facilitating precision and efficiency at greater parameter scales. This suggests that Octo’s modular, open, and empirically validated approach forms a foundation for the next generation of generalist robot policies.