Galaxea: Open-World Robotics Dataset
- Galaxea Open-World Dataset is a comprehensive collection of 100,000 high-resolution robot trajectories from 11 diverse human-centric environments, paired with subtask-level annotations.
- It integrates synchronized multimodal streams—including egocentric views, external observations, and proprioceptive data—to enable precise vision-language-action grounding.
- The dataset underpins the G0 dual-system model, which separates high-level planning from low-level control to achieve robust performance in complex, dynamic robotic tasks.
The Galaxea Open-World Dataset is a large-scale, high-resolution resource of robot demonstration trajectories acquired in authentic human-centric environments. Designed to support the development and evaluation of open-world robotic agents, Galaxea records all trajectories using a consistent robotic embodiment and pairs each with detailed subtask-level language annotations for precise visual-language-action alignment. This dataset provides the empirical foundation for the G0 dual-system Vision-Language-Action (VLA) model, which leverages a curriculum-based training pipeline for robust multimodal planning and fine-grained execution in complex, dynamic tasks.
1. Composition and Scope
The Galaxea Open-World Dataset comprises data recorded from 11 diverse physical environments, including Residential, Retail, Catering, and Office locations. These environments together yield 50 distinct scenes reflecting practical, unstructured settings such as cluttered desks, household kitchens, and bedrooms requiring whole-body mobile manipulation. The dataset incorporates 100,000 demonstration trajectories spanning over 150 task categories. Each trajectory is paired with structured subtask-level language annotations, supporting fine alignment across visual inputs, natural language, and action labels.
Key aspects include:
Attribute | Value | Detail |
---|---|---|
Sites | 11 | Residential, Retail, Catering, Office |
Scenes | 50 | E.g., desk, kitchen, bedroom |
Trajectories | 100,000 | High-fidelity demonstrations |
Tasks | >150 | Fine- and gross-motor manipulation |
Skills | 58 | Pick, stack, appliance use, whole-body |
Objects | >1,600 | Diverse categories, domain generality |
The skill set encompasses atomic pick-and-place actions, dual-arm coordination, and multi-step whole-body operations, enabling the paper of task compositionality and long-horizon control.
2. Dataset Structure and Annotation Protocol
Each demonstration includes synchronized multimodal streams: egocentric and external camera observations, proprioceptive robot states, and subtask-level natural language instructions. Language annotations demarcate both the overall behavioral objective and the boundaries of atomic action segments, providing a fine-grained bridge between perception, symbolics, and continuous control. This protocol is critical for precise vision-language-action grounding, supporting downstream model alignment and evaluation.
Object and skill diversity is achieved by curated selection and placement of over 1,600 unique objects across environments, with scenarios structured to reflect both common and edge-case affordances in human environments. All data is collected using a single, consistent robotic platform, minimizing confounding factors due to embodiment mismatch and ensuring reproducible low-level action semantics.
3. The G0 Dual-System VLA Model
The dataset underpins the G0 dual-system framework for open-world robotic manipulation. G0 consists of two asynchronously coupled systems:
- System 2: A Vision-LLM (VLM) serving as the high-level planner. It interprets free-form language commands, processes visual observations, and decomposes complex objectives into sequences of subtask instructions.
- System 1: A Vision-Language-Action (VLA) model functioning as a low-level controller. It consumes subtask instructions, current observations, and proprioceptive signals to generate continuous robot actions.
This dual-system structure enables separation of deliberative planning and reactive execution, facilitating complex, multi-stage behaviors with reliable low-level stability.
4. Curriculum-Based Model Training
G0’s VLA component is trained using a three-stage curriculum, leveraging both cross-embodiment and single-embodiment data:
- Cross-Embodiment Pre-training: The model is initialised with diverse robot demonstration data drawn from multiple platforms and sources. All continuous action trajectories are tokenized (FAST tokenizer) and autoregressively predicted, conditioned on visual (), linguistic (), and state () information:
where is the discrete tokenized action sequence for step .
- Single-Embodiment Pre-training: Model specialization on Galaxea demonstrations recorded exclusively with a consistent robot embodiment. This stage ensures stability and action fidelity specific to the physical platform and is pivotal for grounded language-action learning at the subtask level.
- Task-Specific Post-training: Final fine-tuning using a small number of high-quality demonstrations tailored to specific tasks, e.g., bed making or microwave operation, further sharpening manipulation precision and language-command following.
The staged curriculum allows the model to acquire generic manipulation priors and subsequently fine-tune for highly specific embodiment details.
5. Benchmarking and Empirical Evaluation
A comprehensive evaluation protocol leverages task diversity in Galaxea:
Task | Focus | Metric/Outcome |
---|---|---|
Table Bussing | Pick-place, dual-arm coordination | Precision, stability |
Microwave Operation | Appliance use, multi-step manipulation | Step-wise action sequence generation |
Bed Making | Whole-body control | Progress, trajectory smoothness |
Blocks Stacking | Language-grounded stacking | Language following, precision |
Key findings:
- Full pipeline models (“G0 (Full)”: cross- plus single-embodiment pre-training) demonstrate highest progress and stability on all tasks.
- Single-embodiment pre-training is essential for precise multi-limb coordination and for exploiting the detailed structure of Galaxea.
- In few-shot settings (e.g., 20 demonstrations), single-embodiment pre-training yields smoother, more stable action generation than models trained on cross-embodiment data alone.
Stage-1 pre-training facilitates generalization to novel actions, but heavy reliance on it can underperform for tasks demanding precise, embodiment-specific control, attesting to the criticality of Galaxea’s consistent demonstration platform.
6. Technical Formulations
Two core loss functions are employed during training:
- Autoregressive Action Prediction:
The model autoregressively predicts the discretized action sequence given multimodal input at each timestep.
- Flow Matching Loss (Stage-2):
where is a noisy interpolation of the target action chunk ; is the predicted flow; is the target flow derived from ground-truth actions. This loss encourages predicted actions to follow the true action trajectory, improving fine-grained execution stability.
7. Significance, Applications, and Prospects
The Galaxea Open-World Dataset and associated G0 model support reliable training and evaluation of vision-language-action agents in settings characterized by task diversity, ambiguous instructions, and dynamic open-world phenomena. The protocol's single-embodiment focus enables generalization within real-world environmental variation while preserving fine motor and cognitive capabilities, as needed for domestic assistance and advanced service robotics.
This suggests that advancements in language-action grounding and low-level control made possible by Galaxea are directly translatable to practical, high-stakes automation domains. The authors indicate plans to expand the dataset to encompass further sensory modalities and embodiment variations. A plausible implication is that future work may seek to optimize the synergy between cross- and single-embodiment training, and to extend transferability of language-action models across heterogeneous robotic platforms (Jiang et al., 30 Aug 2025).