Papers
Topics
Authors
Recent
Search
2000 character limit reached

Octo: Open-Source Generalist Robot Policy

Updated 23 June 2026
  • The paper introduces Octo, an open-source, transformer-based policy trained on 800K real robot trajectories to enable versatile robotic manipulation.
  • Its diffusion-based action head and transformer backbone deliver superior performance, achieving up to 72% success on multi-robot benchmarks.
  • Designed for efficient transfer, Octo supports zero-shot and few-shot finetuning with minimal in-domain data across various robot platforms.

Octo is an open-source, transformer-based generalist robot policy trained on 800,000 real robot trajectories from the Open X-Embodiment dataset, supporting a range of input modalities, robot platforms, and action spaces. Designed for broad applicability, Octo serves as a versatile, efficient foundation for robotic manipulation, enabling zero-shot and few-shot transfer across diverse robotic domains and hardware. Its architecture, empirical results, and open-source release mark a significant advance in scalable generalist robot learning (Team et al., 2024).

1. Motivation and Objectives

Robotic learning has traditionally required training new policies from scratch on specific hardware and tasks, resulting in limited generalization and significant data-collection effort. In fields such as vision and language, foundation models pretrained on large, diverse datasets (e.g., GPT-4, ViTs) have achieved broad transfer and sample efficiency, but robotics has lagged due to the challenges of aggregating large, multi-robot datasets and the prevalence of closed, inflexible models.

Octo is motivated by the need for an open-source, generalist policy that:

  • Accommodates multi-view and multi-modal sensory inputs (RGB cameras, proprioception, force/torque, language, and goal images),
  • Supports a variety of robotic embodiments and heterogeneous action spaces,
  • Admits efficient, fully trainable finetuning to new domains with minimal in-domain data and off-the-shelf GPUs,
  • Enables broad sharing, reproducibility, and rapid iteration via open code, checkpoints, and data loaders.

2. Model Architecture

Octo defines a policy π\pi mapping a history of observations o1:Ho_{1:H} and a task description (language ℓ\ell or goal image gg) to a chunk of future actions at:t+La_{t:t+L}.

2.1 Tokenization and Input Handling

  • Language (â„“\ell): Tokenized by a frozen T5-Base encoder (∼\sim16 tokens).
  • Images (wrist, third-person): Downsampled (wrist: 128×128128\times128, third-person: 256×256256\times256), then patched (16×1616\times16 via CNN), producing visual tokens.
  • Proprioception/Force-Torque: Encoded by a small MLP.
  • Sequence construction: Task tokens, followed by observation tokens for each timestep, are concatenated with learnable positional embeddings.

2.2 Transformer Backbone

Octo is implemented in two scale variants:

  • Octo-Small: 12 layers, o1:Ho_{1:H}0 (27M parameters)
  • Octo-Base: 12 layers, o1:Ho_{1:H}1 (93M parameters)

Causal attention with blockwise masking ensures only current and past observations are visible to each token. "Passive readout" tokens are used for action decoding, analogous to a BERT [CLS] token.

The backbone’s layer operations:

  • Self-Attention:

o1:Ho_{1:H}2

  • Feedforward:

o1:Ho_{1:H}3

2.3 Diffusion-Based Action Head

Action chunks of length o1:Ho_{1:H}4 are predicted by a small MLP acting as a denoising diffusion decoder o1:Ho_{1:H}5. Generation proceeds as follows:

  1. o1:Ho_{1:H}6,
  2. For o1:Ho_{1:H}7:

o1:Ho_{1:H}8

  1. Output o1:Ho_{1:H}9.

Training uses the DDPM noise prediction objective: â„“\ell0

Diffusion outperforms both direct MSE decoders and discrete heads.

3. Data, Pretraining, and Objectives

3.1 Pretraining Mixture

Octo is pretrained on a curated set of 800,000 episodes drawn from the Open X-Embodiment collection. This includes 25 datasets encompassing single-arm and dual-arm manipulation, friction-rich, pick-and-place, and furniture-interaction tasks. The five largest sources are Fractal (17%), Kuka QT-Opt (17%), Bridge Data (17%), BC-Z (9%), and Stanford Hydra (6%).

3.2 Training Methodology

  • Hindsight goal relabeling: With probability â„“\ell1, the language goal â„“\ell2 is replaced by a future observation frame.
  • Data augmentations: Random crops, color jitter, and normalization for images.
  • Loss modes: Random dropout of language or goal-image conditioning enforces robustness to both modes.
  • AdamW optimizer: learning rate â„“\ell3, inverse-square-root decay, â„“\ell4k warmup, weight decay â„“\ell5, gradient clipping.
  • Batch size: â„“\ell6 on TPU v4-128 pods, Base model for 300k steps (â„“\ell714 hours).

4. Finetuning and Domain Adaptation

When transferring Octo to a new robot or domain, â„“\ell8100 in-domain demonstrations suffice. If the sensory or action spaces differ, new tokenizers and output heads can be appended without restarting pretraining. Full-model finetuning (without freezing any backbone layers) is performed for 50k steps using the AdamW recipe with a cosine learning rate decay on a single NVIDIA A5000 GPU (24GB), taking â„“\ell95 hours per domain. Octo has handled:

  • Additional proprioception/force-torque for insertion tasks,
  • Joint-space action heads for dual-arm robots,
  • Unseen robots (e.g., ViperX, ALOHA bimanual).

5. Empirical Performance

5.1 Zero-Shot Multi-Robot Benchmark

Evaluation over three platforms (WidowX BridgeV2, UR5, RT-1) with 10-trial tasks yields:

Model Avg. @ 3 Robots Δ vs. RT-1-X
RT-1-X gg043% –
RT-2-X gg165% n/a (closed)
Octo 72% +29 pp

Goal-image conditioning provides up to 25 percentage points higher success than language-only on specific tasks.

5.2 Few-Shot Finetuning

Across six new domains (100 demos each):

Model Ins. Coffee Baking Pick Coke Bi Avg.
Scratch 10% 45% 25% 0% 20% 20% 20%
VC-1 5% 0% 30% 0% 10% 50% 15%
Octo 70% 75% 50% 60% 100% 80% 72%

Octo outperforms next-best baselines by 52 percentage points on average.

6. Design Analysis and Ablations

6.1 Architecture and Modality Choices

  • ViT-first transformer (Octo-Small) yields 83% success vs. 70% for a ResNet+Transformer.
  • Using the full 25-dataset mixture achieves 83% vs. 60% for a restricted 11-dataset mix or 43% for single-robot data.
  • Policy head: diffusion (83%) greatly outperforms MSE (35%) and discrete (18%) heads.
  • Scaling: Octo-Tiny (10M): gg250%, Octo-Small (27M): gg365%, Octo-Base (93M): gg472% (averaged on WidowX).

6.2 Other Findings

  • History: A single previous frame is sufficient; additional frames add marginal benefit.
  • Patch size: gg5 visual tokens outperform gg6 for fine manipulation.
  • Pretrained ResNet vision or MSE heads tend to produce overly conservative actions.
  • Visual grounding is essential; purely proprioceptive models suffer causal confusion.

7. Broader Implications, Limitations, and Recommendations

Octo establishes that large, flexible, transformer-based policies can serve as powerful generalist robot models, enabling data-efficient transfer to new domains. However, the fraction of wrist-camera and language-annotated data is a bottleneck (27% and 56%, respectively), suggesting that additional multimodal data collection would further strengthen generalization. Incorporating interactive or perturbed data for robustness (e.g., recovery from failure) and extending to broader settings like mobile manipulation are recommended. Model scaling, better end-to-end visuo-lingual grounding, and hybrid reinforcement/imitation objectives are noted as promising avenues.

By releasing code, training pipelines, and pretrained models (Octo-Small, Octo-Base) at https://octo-models.github.io, Octo supports transparent and reproducible research in generalist robotic manipulation (Team et al., 2024). Contemporary research (e.g., Dita (Hou et al., 25 Mar 2025)) points to the potential of further unifying diffusion with in-context transformer architectures, facilitating precision and efficiency at greater parameter scales. This suggests that Octo’s modular, open, and empirically validated approach forms a foundation for the next generation of generalist robot policies.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Octo: An Open-Source Generalist Robot Policy.