Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 27 tok/s

Gemini 2.5 Pro 46 tok/s Pro

GPT-5 Medium 23 tok/s Pro

GPT-5 High 29 tok/s Pro

GPT-4o 70 tok/s Pro

Kimi K2 117 tok/s Pro

GPT OSS 120B 459 tok/s Pro

Claude Sonnet 4 34 tok/s Pro

2000 character limit reached

EmbodiedOneVision: Interleaved Vision-Text-Action Pretraining for General Robot Control (2508.21112v1)

Published 28 Aug 2025 in cs.RO and cs.AI

Abstract: The human ability to seamlessly perform multimodal reasoning and physical interaction in the open world is a core goal for general-purpose embodied intelligent systems. Recent vision-language-action (VLA) models, which are co-trained on large-scale robot and visual-text data, have demonstrated notable progress in general robot control. However, they still fail to achieve human-level flexibility in interleaved reasoning and interaction. In this work, introduce EO-Robotics, consists of EO-1 model and EO-Data1.5M dataset. EO-1 is a unified embodied foundation model that achieves superior performance in multimodal embodied reasoning and robot control through interleaved vision-text-action pre-training. The development of EO-1 is based on two key pillars: (i) a unified architecture that processes multimodal inputs indiscriminately (image, text, video, and action), and (ii) a massive, high-quality multimodal embodied reasoning dataset, EO-Data1.5M, which contains over 1.5 million samples with emphasis on interleaved vision-text-action comprehension. EO-1 is trained through synergies between auto-regressive decoding and flow matching denoising on EO-Data1.5M, enabling seamless robot action generation and multimodal embodied reasoning. Extensive experiments demonstrate the effectiveness of interleaved vision-text-action learning for open-world understanding and generalization, validated through a variety of long-horizon, dexterous manipulation tasks across multiple embodiments. This paper details the architecture of EO-1, the data construction strategy of EO-Data1.5M, and the training methodology, offering valuable insights for developing advanced embodied foundation models.

Summary

The paper introduces a unified VLA model that interleaves vision, text, and action data for general robot control.
It employs a hybrid decoding strategy combining autoregressive prediction and flow matching to improve both language and action outputs.
Extensive benchmarks show superior performance in embodied reasoning, dexterous manipulation, and open-world generalization.

EmbodiedOneVision: Interleaved Vision-Text-Action Pretraining for General Robot Control

Introduction

EmbodiedOneVision introduces a unified Vision-Language-Action (VLA) model and a large-scale interleaved multimodal dataset for generalist robot control. The central innovation is the integration of interleaved vision, text, and action data into a single transformer-based architecture, enabling seamless embodied reasoning and dexterous manipulation across diverse real-world scenarios. The model is trained with a hybrid objective combining autoregressive next-token prediction for discrete modalities and flow-matching denoising for continuous robot actions, leveraging a curated dataset that emphasizes temporal and causal relationships among modalities.

Unified Model Architecture

The EmbodiedOneVision model adopts a single decoder-only transformer backbone initialized from Qwen2.5-VL, supporting both discrete (language) and continuous (action) outputs. All modalities—language instructions, image observations, robot state, and noisy actions—are encoded into an interleaved token sequence, processed by the shared transformer. Discrete LLMing is handled via a classic logits head, while continuous actions are generated using a flow-matching head, with actions denoised through forward Euler integration of the predicted vector field.

Figure 1: Model architecture showing unified transformer backbone with discrete language head and continuous flow-matching head for robot action generation.

This design circumvents the need for action-specific modules, facilitating direct cross-modal knowledge transfer and alignment. The architecture supports causal attention across the entire interleaved sequence, capturing dependencies between reasoning and acting, and enabling the model to alternate between multimodal reasoning and physical control.

Interleaved Data Construction and Sampling

A scalable data curation pipeline is employed to construct the EO-Robotics dataset, integrating web-based vision-language data with real robot episodes. Human annotators and VLMs generate diverse QA pairs for embodied temporal and spatial reasoning, including physical commonsense, task planning, object localization, affordance pointing, and multiview correspondence. These are concatenated with robot control actions in temporal order, forming interleaved vision-text-action sequences.

Figure 2: Interleaved rectifying sampling strategy for efficient mixed-modality training while preserving causal relationships.

To address the disruption of causal relationships during action denoising, a rectifying sampling strategy is introduced. Variable-length subsequences are sampled from action generation segments, with noisy action tokens replaced by clean ones in intermediate segments, ensuring proper gradient flow and causal alignment during training.

Dataset Composition and Benchmarking

The EO-Robotics dataset comprises over 1.5M interleaved samples, 1.2M robot control episodes, and 5.7M web multimodal samples, totaling 135B tokens. The interleaved data is annotated for both temporal and spatial reasoning, supporting fine-grained geometric and dynamic understanding.

Figure 3: Dataset statistics, curation pipeline, and interleaved data examples illustrating multimodal concatenation formats.

A comprehensive benchmark is constructed to evaluate embodied reasoning, covering spatial understanding, physical commonsense, task reasoning, and state estimation. Each QA instance targets a specific reasoning aspect, enabling interpretable and disentangled evaluation.

Figure 4: Benchmark examples including multiview pointing, physical commonsense, trajectory prediction, process verification, task planning, and affordance.

Training Paradigm

Training is performed on the full multimodal corpus for five epochs using Flash-Attention and variable-length packing. The model is optimized with a balanced loss between language regression and action flow matching. At inference, multi-view camera observations and sub-task instructions condition the policy, which predicts 16-step action chunks via 10 denoising iterations, supporting real-time deployment on a single RTX 4090 GPU.

Experimental Results

Embodied Reasoning

EmbodiedOneVision is evaluated on RoboVQA, ERQA, and the self-constructed benchmark, outperforming both open-source and closed-source VLMs and VLA models. On RoboVQA, it achieves a BLEU-4 score of 58.5, surpassing GPT-4o (47.2). On ERQA, it reaches 45.5 accuracy, exceeding InternVL2.5 8B (45.2). In the custom benchmark, it demonstrates strong spatial and temporal reasoning, with an overall score of 44.8.

Robot Control

On LIBERO, EmbodiedOneVision attains a 98.2% average success rate, outperforming OpenVLA-OFT, $\pi_0$ , and GR00T-N1 by 1.1–4.3%. On SimplerEnv, it achieves the highest success rates across all variants (WidowX: 72.7%, Google-VM: 76.5%, Google-VA: 63.0%), demonstrating robust generalization and data efficiency.

Real-World Manipulation

The model is tested on 28 real-world tasks across Franka Panda, WidowX 250 S, Agibot G-1, and Lerobot SO100, including pick-and-place, articulated manipulation, long-horizon dexterity, and embodied reasoning control.

Figure 5: Real-world evaluation tasks on diverse robots, illustrating long-horizon dexterity, pick-and-place, out-of-box, and embodied reasoning control.

EmbodiedOneVision consistently outperforms baselines, achieving 86.0% overall, 81.0% on long-horizon Agibot G-1 tasks, and 94.0% on Franka Panda pick-and-place. On reasoning-control tasks, it demonstrates superior integration of planning and execution, avoiding plan–act mismatches common in hierarchical pipelines.

Figure 6: Performance comparison across robot platforms and task categories.

Figure 7: Long-horizon dexterity completion rate comparison on Agibot G-1.

Open-World Generalization

Generalization is assessed along visual, action, and language axes. EmbodiedOneVision achieves 73.0% overall, outperforming GR00T-N1.5 (60.0%) and $\pi_0$ (51.0%). The largest gains are in language robustness, confirming the efficacy of interleaved training.

Figure 8: Instruction-following in open-world settings and success rates over 18 tasks.

Figure 9: Generalization axes—visual, action, and language variations.

Unified Reasoning and Control

A reasoning-control benchmark evaluates the model's ability to integrate high-level reasoning with low-level control. EmbodiedOneVision outperforms hierarchical baselines on all tasks, with the largest margin in Tic-Tac-Toe (+40 points).

Figure 10: Qualitative rollouts for unified reasoning and control, showing perception, reasoning, and precise action execution.

Figure 11: Success rates on the reasoning-control benchmark, with EmbodiedOneVision outperforming hierarchical baselines.

Ablation and Scaling Analysis

Hybrid decoding (autoregression + flow matching) consistently outperforms pure autoregression, especially in action and language generalization. Scaling interleaved data further boosts generalization, while generic instruction-following data can degrade physical grounding. The unified architecture enables stable, high-performance outcomes under distribution shift.

Figure 12: WidowX generalization performance across data scales for base, interleaved, and fast models.

Implementation Considerations

Computational Requirements: Training requires high memory bandwidth and efficient attention mechanisms (Flash-Attention, DeepSpeed ZeRO-1). Inference is feasible on consumer-grade GPUs (RTX 4090, 6GB VRAM).
Data Curation: Interleaved data construction is critical; annotation pipelines must ensure diversity and causal alignment.
Deployment: The unified model supports real-time control and reasoning, suitable for both simulation and real-world robots.
Limitations: Generalization to navigation, human-robot interaction, and failure analysis remains challenging; future work should expand data modalities and asynchronous reasoning-action pipelines.

Implications and Future Directions

EmbodiedOneVision demonstrates that unified, interleaved multimodal pretraining substantially improves open-world generalization, reasoning, and dexterous control in robotics. The approach sets a precedent for integrating perception, planning, and action in a single model, with strong empirical results across benchmarks and real-world tasks. Future research should focus on scaling to more complex scenarios, enhancing simultaneous reasoning and action, and incorporating broader data sources (e.g., human demonstrations, navigation tasks).

Conclusion

EmbodiedOneVision establishes a robust framework for generalist robot policies, leveraging interleaved vision-text-action pretraining and a unified transformer architecture. The model achieves state-of-the-art performance in embodied reasoning and manipulation, with strong generalization across modalities and environments. The open-source release of model weights, code, and dataset provides a valuable resource for advancing embodied AI and autonomous robotics.