EO-Data1.5M Multimodal Dataset

Updated 1 September 2025

EO-Data1.5M is a large-scale, high-quality multimodal dataset integrating visual, language, and action tokens for comprehensive embodied AI research.
The dataset is meticulously curated using diverse robot platforms and rigorous pipelines to ensure rich, temporally-structured episodes for long-horizon tasks.
Its unified interleaved design facilitates the training of embodied foundation models by aligning perception, linguistic reasoning, and robot control in a single framework.

EO-Data1.5M is a large-scale, high-quality multimodal dataset designed to enable advanced research in embodied AI, particularly in domains requiring interleaved vision, language, and robot action comprehension. Developed as a primary pillar in the EO-Robotics framework, EO-Data1.5M contains over 1.5 million temporally-structured samples that tightly couple visual observations, linguistic reasoning (via task planning QAs, subtask descriptions, and physical common sense queries), and explicit robot control commands. Its construction strategically leverages diverse robot platforms and rigorous curation pipelines to ensure rich coverage of long-horizon tasks, dexterous manipulation, and open-world scenario adaptability. The dataset is expressly structured for the pre-training of embodied foundation models using unified architectures capable of ingesting and aligning large-scale interleaved sequences of perceptual, linguistic, and motor data.

1. Dataset Composition and Structure

EO-Data1.5M consists of approximately 1.5 million episodes (samples), reported to contain roughly 1.0 billion tokens after tokenization. Each sample represents an interleaved multimodal segment, concatenating visual, textual, and action information in strict temporal order. The modalities present in every sample are:

Visual tokens: These are processed image patches extracted from egocentric and auxiliary robot camera views (e.g., left, right, wrist).
Text tokens: Comprehensive natural language annotations, including direct task instructions, question–answer pairs on task planning, subtask descriptors, physical/spatial reasoning queries, and process verification statements.
Action tokens: Continuous representations encoding robot motor control commands.

In each episode, the modalities are not segmented into separate blocks but interleaved such that, for example, an image token sequence is directly followed by a planning QA, then the corresponding action token block. This structure ensures the model learns fine-grained temporal and causal dependencies linking perception, reasoning, and control.

Modality	Detail	Representation
Visual	Egocentric and auxiliary camera views	Image tokens
Language	Instructions, QAs, captions, common sense/planning text	Text tokens
Action	Robot control signals for manipulation and movement	Continuous tokens

2. Data Collection and Curation Pipeline

The data acquisition leverages episode logs from multiple robot platforms, including AgiBot, RoboMind, SO100, and IPEC-Franka. Each raw robot video is subjected to feature extraction using a pretrained vision backbone, followed by unsupervised K-means clustering of video features. A fixed number of representative videos per cluster are selected and manually reviewed to remove redundancy.

Subsequently, videos are segmented into short clips aligned with single subtasks, and descriptive captions are generated by a combination of human annotators and pretrained vision-LLMs (VLMs). These clips and captions serve as the source for automated QA generation: two principal QA schemes are used.

Temporal reasoning QA probes planning, episode sequencing, process verification, and outcome prediction.
Spatial reasoning QA targets trajectory prediction, object referencing, multiview correspondence, and manipulation planning.

Final QA pairs are processed through rule-based cleaning and LLM-based rewriting to ensure format uniformity, linguistic diversity, and context alignment. This suggests a multi-tiered approach for ensuring sample quality and diversity.

3. Interleaved Vision-Text-Action Comprehension

EO-Data1.5M is distinguished by its interleaved design, where each sample acts as a compact, temporally-linked fusion of visual, language, and control modalities. For instance, a typical segment might open with a set of image tokens encoding the observed scene, immediately followed by a task-focused QA (e.g., “Q: Based on the current scene, what should be done next? A: Grasp the yellow box and move it toward the shelf”), then a sequence of continuous action tokens representing the real-time control signals issued by the robot.

Spatial reasoning is also interleaved: after a question regarding predicted trajectory or object location, a set of movement instructions and their corresponding action tokens are presented. This tightly coupled format instantiates direct correlations between perception, reasoning, and action, which is essential for the training of models intended to generalize to open-world, dexterous manipulation, and embodied reasoning tasks.

4. Model Training: Unified Architecture and Loss Functions

Models trained on EO-Data1.5M (notably the EO-1 embodied foundation model) employ a unified decoder-only transformer, capable of supporting both discrete (text) and continuous (action) token outputs. The training process is predicated on processing interleaved multimodal token sequences:

Discrete tokens (text): Generated auto-regressively, following a next-token prediction regime with cross-entropy loss.
Continuous tokens (actions): Modeled through a flow matching denoising process, where noisy action inputs $a_t^\tau = \tau \cdot a_t + (1-\tau) \cdot z_0$ are integrated via Euler’s rule using a learned vector field $V_{\pi_\theta}$ .

Key mathematical formulations from the paper include:

Statistical modeling of the interleaved sequence:

$p(\hat{x}_{1:N}) = \prod_{i=1}^N p(\hat{x}_i \mid x_{<i})$

Flow-matching for actions:

$a_t^\tau = \tau \cdot a_t + (1-\tau) \cdot z_0$

$\hat{a}_t^{\tau+\delta} = \hat{a}_t^\tau + \delta \cdot V_{\pi_\theta}(\hat{a}_t^\tau, \tau \mid x_{<a})$

$\mathcal{L}_{fm}(\theta) = \mathbb{E}_\tau \left[ \parallel V_\theta(a_t^\tau, \tau \mid x_{<a}) - (z_0 - a_t)\parallel^2 \right]$

Overall loss:

$\mathcal{L} = \mathcal{L}_{ar}(\theta) + \mathcal{L}_{fm}(\theta)$

This training regime directly aligns semantic understanding from language and vision with fine-grained robot control, a plausible implication is improved real-world transferability for open-world and long-horizon task settings.

5. Benchmark Evaluation and Empirical Findings

EO-Data1.5M enables the training of generalist vision-language-action (VLA) models evaluated on a suite of embodied reasoning and manipulation benchmarks. Tests include RoboVQA and ERQA for visuospatial and physical common sense inference, as well as LIBERO and SimplerEnv for evaluating dexterous robot control.

Strong empirical results demonstrate:

Superior generalization on reasoning tasks, including complex planning and spatial inference.
Increased success rates in manipulation and long-horizon planning tasks across multiple robot embodiments.
The unified interleaved training regimen efficiently bridges the perception-control gap commonly observed in hierarchical architectures.

This suggests that EO-Data1.5M’s tight multimodal coupling is a principal driver of its utility in embodied agent training.

6. Broader Context and Interoperability

EO-Data1.5M represents a class of purpose-designed datasets intended for large-scale embodied intelligence research, complementary with catalogue platforms like the IEEE GRSS Earth Observation Database (EOD) (2209.12480). While EOD focuses on cataloguing annotated remote sensing datasets across diverse imaging modalities and geographies, EO-Data1.5M specializes in interleaved embodied reasoning data for robot learning contexts. Notably, platforms such as EOD enable standardized metadata schemas, comparison and modular search, facilitating cross-dataset analysis and integration efforts essential for reproducibility and benchmarking in embodied AI.

7. Significance in Embodied AI Research

EO-Data1.5M marks a substantive advance in dataset design for embodied foundation model training, aligning multimodal perception, language-based reasoning, and motor control within cohesive, temporally-structured data samples. The dataset and the associated interleaved training regime address a central challenge in embodied AI—enabling agents to perform seamless, human-like reasoning and dexterous action in unstructured, open-world scenarios. Models pre-trained on EO-Data1.5M constitute a critical step toward general robot intelligence, as empirically validated on demanding embodied VLA benchmarks (Qu et al., 28 Aug 2025).

In sum, EO-Data1.5M provides a robust, experimentally-proven resource for advancing research at the intersection of multimodal learning, embodied reasoning, and robot control, offering both data infrastructure and methodological blueprints for the next generation of foundation models in open-world embodied AI.

PDF Markdown Chat (Pro)

References (2)

EOD: The IEEE GRSS Earth Observation Database (2022)

EmbodiedOneVision: Interleaved Vision-Text-Action Pretraining for General Robot Control (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to EO-Data1.5M Dataset.