Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 89 tok/s
Gemini 2.5 Pro 53 tok/s Pro
GPT-5 Medium 26 tok/s Pro
GPT-5 High 25 tok/s Pro
GPT-4o 93 tok/s Pro
Kimi K2 221 tok/s Pro
GPT OSS 120B 457 tok/s Pro
Claude Sonnet 4 38 tok/s Pro
2000 character limit reached

Galaxea: Open-World Robotics Dataset

Updated 6 September 2025
  • Galaxea Open-World Dataset is a comprehensive collection of 100,000 high-resolution robot trajectories from 11 diverse human-centric environments, paired with subtask-level annotations.
  • It integrates synchronized multimodal streams—including egocentric views, external observations, and proprioceptive data—to enable precise vision-language-action grounding.
  • The dataset underpins the G0 dual-system model, which separates high-level planning from low-level control to achieve robust performance in complex, dynamic robotic tasks.

The Galaxea Open-World Dataset is a large-scale, high-resolution resource of robot demonstration trajectories acquired in authentic human-centric environments. Designed to support the development and evaluation of open-world robotic agents, Galaxea records all trajectories using a consistent robotic embodiment and pairs each with detailed subtask-level language annotations for precise visual-language-action alignment. This dataset provides the empirical foundation for the G0 dual-system Vision-Language-Action (VLA) model, which leverages a curriculum-based training pipeline for robust multimodal planning and fine-grained execution in complex, dynamic tasks.

1. Composition and Scope

The Galaxea Open-World Dataset comprises data recorded from 11 diverse physical environments, including Residential, Retail, Catering, and Office locations. These environments together yield 50 distinct scenes reflecting practical, unstructured settings such as cluttered desks, household kitchens, and bedrooms requiring whole-body mobile manipulation. The dataset incorporates 100,000 demonstration trajectories spanning over 150 task categories. Each trajectory is paired with structured subtask-level language annotations, supporting fine alignment across visual inputs, natural language, and action labels.

Key aspects include:

Attribute Value Detail
Sites 11 Residential, Retail, Catering, Office
Scenes 50 E.g., desk, kitchen, bedroom
Trajectories 100,000 High-fidelity demonstrations
Tasks >150 Fine- and gross-motor manipulation
Skills 58 Pick, stack, appliance use, whole-body
Objects >1,600 Diverse categories, domain generality

The skill set encompasses atomic pick-and-place actions, dual-arm coordination, and multi-step whole-body operations, enabling the paper of task compositionality and long-horizon control.

2. Dataset Structure and Annotation Protocol

Each demonstration includes synchronized multimodal streams: egocentric and external camera observations, proprioceptive robot states, and subtask-level natural language instructions. Language annotations demarcate both the overall behavioral objective and the boundaries of atomic action segments, providing a fine-grained bridge between perception, symbolics, and continuous control. This protocol is critical for precise vision-language-action grounding, supporting downstream model alignment and evaluation.

Object and skill diversity is achieved by curated selection and placement of over 1,600 unique objects across environments, with scenarios structured to reflect both common and edge-case affordances in human environments. All data is collected using a single, consistent robotic platform, minimizing confounding factors due to embodiment mismatch and ensuring reproducible low-level action semantics.

3. The G0 Dual-System VLA Model

The dataset underpins the G0 dual-system framework for open-world robotic manipulation. G0 consists of two asynchronously coupled systems:

  • System 2: A Vision-LLM (VLM) serving as the high-level planner. It interprets free-form language commands, processes visual observations, and decomposes complex objectives into sequences of subtask instructions.
  • System 1: A Vision-Language-Action (VLA) model functioning as a low-level controller. It consumes subtask instructions, current observations, and proprioceptive signals to generate continuous robot actions.

This dual-system structure enables separation of deliberative planning and reactive execution, facilitating complex, multi-stage behaviors with reliable low-level stability.

4. Curriculum-Based Model Training

G0’s VLA component is trained using a three-stage curriculum, leveraging both cross-embodiment and single-embodiment data:

  1. Cross-Embodiment Pre-training: The model is initialised with diverse robot demonstration data drawn from multiple platforms and sources. All continuous action trajectories are tokenized (FAST tokenizer) and autoregressively predicted, conditioned on visual (oto_t), linguistic (ltl_t), and state (sts_t) information:

p(AtD)=i=1Np(aiDa<iD,ot,lt,st)p(\mathbf{A}^D_t) = \prod_{i=1}^N p(a^D_i \mid a^D_{<i}, o_t, l_t, s_t)

where AtD\mathbf{A}^D_t is the discrete tokenized action sequence for step tt.

  1. Single-Embodiment Pre-training: Model specialization on Galaxea demonstrations recorded exclusively with a consistent robot embodiment. This stage ensures stability and action fidelity specific to the physical platform and is pivotal for grounded language-action learning at the subtask level.
  2. Task-Specific Post-training: Final fine-tuning using a small number of high-quality demonstrations tailored to specific tasks, e.g., bed making or microwave operation, further sharpening manipulation precision and language-command following.

The staged curriculum allows the model to acquire generic manipulation priors and subsequently fine-tune for highly specific embodiment details.

5. Benchmarking and Empirical Evaluation

A comprehensive evaluation protocol leverages task diversity in Galaxea:

Task Focus Metric/Outcome
Table Bussing Pick-place, dual-arm coordination Precision, stability
Microwave Operation Appliance use, multi-step manipulation Step-wise action sequence generation
Bed Making Whole-body control Progress, trajectory smoothness
Blocks Stacking Language-grounded stacking Language following, precision

Key findings:

  • Full pipeline models (“G0 (Full)”: cross- plus single-embodiment pre-training) demonstrate highest progress and stability on all tasks.
  • Single-embodiment pre-training is essential for precise multi-limb coordination and for exploiting the detailed structure of Galaxea.
  • In few-shot settings (e.g., 20 demonstrations), single-embodiment pre-training yields smoother, more stable action generation than models trained on cross-embodiment data alone.

Stage-1 pre-training facilitates generalization to novel actions, but heavy reliance on it can underperform for tasks demanding precise, embodiment-specific control, attesting to the criticality of Galaxea’s consistent demonstration platform.

6. Technical Formulations

Two core loss functions are employed during training:

  • Autoregressive Action Prediction:

p(AtD)=i=1Np(aiDa<iD,ot,lt,st)p(\mathbf{A}^D_t) = \prod_{i=1}^N p(a^D_i \mid a^D_{<i}, o_t, l_t, s_t)

The model autoregressively predicts the discretized action sequence AtD\mathbf{A}_t^D given multimodal input at each timestep.

Lflow(θ)=Ep(Atτot,lt,st)[vθ(Atτ,τ,ot,lt,st)u(AtτAt)2]\mathcal{L}_{\text{flow}}(\theta) = \mathbb{E}_{p(\mathbf{A}^\tau_t \mid o_t, l_t, s_t)}\, [ \| v_\theta(\mathbf{A}^\tau_t, \tau, o_t, l_t, s_t) - u(\mathbf{A}^\tau_t \mid \mathbf{A}_t) \|^2 ]

where Atτ\mathbf{A}^\tau_t is a noisy interpolation of the target action chunk At\mathbf{A}_t; vθv_\theta is the predicted flow; uu is the target flow derived from ground-truth actions. This loss encourages predicted actions to follow the true action trajectory, improving fine-grained execution stability.

7. Significance, Applications, and Prospects

The Galaxea Open-World Dataset and associated G0 model support reliable training and evaluation of vision-language-action agents in settings characterized by task diversity, ambiguous instructions, and dynamic open-world phenomena. The protocol's single-embodiment focus enables generalization within real-world environmental variation while preserving fine motor and cognitive capabilities, as needed for domestic assistance and advanced service robotics.

This suggests that advancements in language-action grounding and low-level control made possible by Galaxea are directly translatable to practical, high-stakes automation domains. The authors indicate plans to expand the dataset to encompass further sensory modalities and embodiment variations. A plausible implication is that future work may seek to optimize the synergy between cross- and single-embodiment training, and to extend transferability of language-action models across heterogeneous robotic platforms (Jiang et al., 30 Aug 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Galaxea Open-World Dataset.