Octo: Open-Source Generalist Robot Policy

Updated 23 June 2026

The paper introduces Octo, an open-source, transformer-based policy trained on 800K real robot trajectories to enable versatile robotic manipulation.
Its diffusion-based action head and transformer backbone deliver superior performance, achieving up to 72% success on multi-robot benchmarks.
Designed for efficient transfer, Octo supports zero-shot and few-shot finetuning with minimal in-domain data across various robot platforms.

Octo is an open-source, transformer-based generalist robot policy trained on 800,000 real robot trajectories from the Open X-Embodiment dataset, supporting a range of input modalities, robot platforms, and action spaces. Designed for broad applicability, Octo serves as a versatile, efficient foundation for robotic manipulation, enabling zero-shot and few-shot transfer across diverse robotic domains and hardware. Its architecture, empirical results, and open-source release mark a significant advance in scalable generalist robot learning (Team et al., 2024).

1. Motivation and Objectives

Robotic learning has traditionally required training new policies from scratch on specific hardware and tasks, resulting in limited generalization and significant data-collection effort. In fields such as vision and language, foundation models pretrained on large, diverse datasets (e.g., GPT-4, ViTs) have achieved broad transfer and sample efficiency, but robotics has lagged due to the challenges of aggregating large, multi-robot datasets and the prevalence of closed, inflexible models.

Octo is motivated by the need for an open-source, generalist policy that:

Accommodates multi-view and multi-modal sensory inputs (RGB cameras, proprioception, force/torque, language, and goal images),
Supports a variety of robotic embodiments and heterogeneous action spaces,
Admits efficient, fully trainable finetuning to new domains with minimal in-domain data and off-the-shelf GPUs,
Enables broad sharing, reproducibility, and rapid iteration via open code, checkpoints, and data loaders.

2. Model Architecture

Octo defines a policy $\pi$ mapping a history of observations $o_{1:H}$ and a task description (language $\ell$ or goal image $g$ ) to a chunk of future actions $a_{t:t+L}$ .

2.1 Tokenization and Input Handling

Language ( $\ell$ ): Tokenized by a frozen T5-Base encoder ( $\sim$ 16 tokens).
Images (wrist, third-person): Downsampled (wrist: $128\times128$ , third-person: $256\times256$ ), then patched ( $16\times16$ via CNN), producing visual tokens.
Proprioception/Force-Torque: Encoded by a small MLP.
Sequence construction: Task tokens, followed by observation tokens for each timestep, are concatenated with learnable positional embeddings.

2.2 Transformer Backbone

Octo is implemented in two scale variants:

Octo-Small: 12 layers, $o_{1:H}$ 0 (27M parameters)
Octo-Base: 12 layers, $o_{1:H}$ 1 (93M parameters)

Causal attention with blockwise masking ensures only current and past observations are visible to each token. "Passive readout" tokens are used for action decoding, analogous to a BERT [CLS] token.

The backbone’s layer operations:

Self-Attention:

$o_{1:H}$ 2

Feedforward:

$o_{1:H}$ 3

2.3 Diffusion-Based Action Head

Action chunks of length $o_{1:H}$ 4 are predicted by a small MLP acting as a denoising diffusion decoder $o_{1:H}$ 5. Generation proceeds as follows:

$o_{1:H}$ 6,
For $o_{1:H}$ 7:

$o_{1:H}$ 8

Output $o_{1:H}$ 9.

Training uses the DDPM noise prediction objective: $\ell$ 0

Diffusion outperforms both direct MSE decoders and discrete heads.

3. Data, Pretraining, and Objectives

3.1 Pretraining Mixture

Octo is pretrained on a curated set of 800,000 episodes drawn from the Open X-Embodiment collection. This includes 25 datasets encompassing single-arm and dual-arm manipulation, friction-rich, pick-and-place, and furniture-interaction tasks. The five largest sources are Fractal (17%), Kuka QT-Opt (17%), Bridge Data (17%), BC-Z (9%), and Stanford Hydra (6%).

3.2 Training Methodology

Hindsight goal relabeling: With probability $\ell$ 1, the language goal $\ell$ 2 is replaced by a future observation frame.
Data augmentations: Random crops, color jitter, and normalization for images.
Loss modes: Random dropout of language or goal-image conditioning enforces robustness to both modes.
AdamW optimizer: learning rate $\ell$ 3, inverse-square-root decay, $\ell$ 4k warmup, weight decay $\ell$ 5, gradient clipping.
Batch size: $\ell$ 6 on TPU v4-128 pods, Base model for 300k steps ( $\ell$ 714 hours).

4. Finetuning and Domain Adaptation

When transferring Octo to a new robot or domain, $\ell$ 8100 in-domain demonstrations suffice. If the sensory or action spaces differ, new tokenizers and output heads can be appended without restarting pretraining. Full-model finetuning (without freezing any backbone layers) is performed for 50k steps using the AdamW recipe with a cosine learning rate decay on a single NVIDIA A5000 GPU (24GB), taking $\ell$ 95 hours per domain. Octo has handled:

Additional proprioception/force-torque for insertion tasks,
Joint-space action heads for dual-arm robots,
Unseen robots (e.g., ViperX, ALOHA bimanual).

5. Empirical Performance

5.1 Zero-Shot Multi-Robot Benchmark

Evaluation over three platforms (WidowX BridgeV2, UR5, RT-1) with 10-trial tasks yields:

Model	Avg. @ 3 Robots	Δ vs. RT-1-X
RT-1-X	$g$ 043%	–
RT-2-X	$g$ 165%	n/a (closed)
Octo	72%	+29 pp

Goal-image conditioning provides up to 25 percentage points higher success than language-only on specific tasks.

5.2 Few-Shot Finetuning

Across six new domains (100 demos each):

Model	Ins.	Coffee	Baking	Pick	Coke	Bi	Avg.
Scratch	10%	45%	25%	0%	20%	20%	20%
VC-1	5%	0%	30%	0%	10%	50%	15%
Octo	70%	75%	50%	60%	100%	80%	72%

Octo outperforms next-best baselines by 52 percentage points on average.

6. Design Analysis and Ablations

6.1 Architecture and Modality Choices

ViT-first transformer (Octo-Small) yields 83% success vs. 70% for a ResNet+Transformer.
Using the full 25-dataset mixture achieves 83% vs. 60% for a restricted 11-dataset mix or 43% for single-robot data.
Policy head: diffusion (83%) greatly outperforms MSE (35%) and discrete (18%) heads.
Scaling: Octo-Tiny (10M): $g$ 250%, Octo-Small (27M): $g$ 365%, Octo-Base (93M): $g$ 472% (averaged on WidowX).

6.2 Other Findings

History: A single previous frame is sufficient; additional frames add marginal benefit.
Patch size: $g$ 5 visual tokens outperform $g$ 6 for fine manipulation.
Pretrained ResNet vision or MSE heads tend to produce overly conservative actions.
Visual grounding is essential; purely proprioceptive models suffer causal confusion.

7. Broader Implications, Limitations, and Recommendations

Octo establishes that large, flexible, transformer-based policies can serve as powerful generalist robot models, enabling data-efficient transfer to new domains. However, the fraction of wrist-camera and language-annotated data is a bottleneck (27% and 56%, respectively), suggesting that additional multimodal data collection would further strengthen generalization. Incorporating interactive or perturbed data for robustness (e.g., recovery from failure) and extending to broader settings like mobile manipulation are recommended. Model scaling, better end-to-end visuo-lingual grounding, and hybrid reinforcement/imitation objectives are noted as promising avenues.

By releasing code, training pipelines, and pretrained models (Octo-Small, Octo-Base) at https://octo-models.github.io, Octo supports transparent and reproducible research in generalist robotic manipulation (Team et al., 2024). Contemporary research (e.g., Dita (Hou et al., 25 Mar 2025)) points to the potential of further unifying diffusion with in-context transformer architectures, facilitating precision and efficiency at greater parameter scales. This suggests that Octo’s modular, open, and empirically validated approach forms a foundation for the next generation of generalist robot policies.

Markdown Report Issue Upgrade to Chat

References (2)

Octo: An Open-Source Generalist Robot Policy (2024)

Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Octo: An Open-Source Generalist Robot Policy.