Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation (2508.05635v1)

Published 7 Aug 2025 in cs.RO and cs.CV

Abstract: We introduce Genie Envisioner (GE), a unified world foundation platform for robotic manipulation that integrates policy learning, evaluation, and simulation within a single video-generative framework. At its core, GE-Base is a large-scale, instruction-conditioned video diffusion model that captures the spatial, temporal, and semantic dynamics of real-world robotic interactions in a structured latent space. Built upon this foundation, GE-Act maps latent representations to executable action trajectories through a lightweight, flow-matching decoder, enabling precise and generalizable policy inference across diverse embodiments with minimal supervision. To support scalable evaluation and training, GE-Sim serves as an action-conditioned neural simulator, producing high-fidelity rollouts for closed-loop policy development. The platform is further equipped with EWMBench, a standardized benchmark suite measuring visual fidelity, physical consistency, and instruction-action alignment. Together, these components establish Genie Envisioner as a scalable and practical foundation for instruction-driven, general-purpose embodied intelligence. All code, models, and benchmarks will be released publicly.

Summary

The paper presents a unified platform that integrates video diffusion, action decoding, and simulation for robotic manipulation.
It introduces GE-Base and GE-Act with advanced cross-attention and multi-view capabilities for accurate, temporally coherent task execution.
Empirical evaluations demonstrate superior performance and cross-embodiment generalization compared to state-of-the-art baselines.

Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation

Introduction and Motivation

Genie Envisioner (GE) introduces a unified, video-generative world foundation platform for robotic manipulation, integrating policy learning, evaluation, and simulation within a single architecture. The platform addresses the fragmentation in current robotic learning pipelines, which typically separate data collection, policy learning, and evaluation, leading to inefficiencies and limited scalability. GE's core innovation is the consolidation of these stages into a closed-loop, vision-centric framework, leveraging large-scale, instruction-conditioned video diffusion models to encode the spatial, temporal, and semantic structure of real-world robotic interactions.

Figure 1: Overview of the Genie Envisioner World Foundation Platform, highlighting the integration of GE-Base, GE-Act, GE-Sim, and EWMBench into a unified system for robotic manipulation.

GE-Base: World Foundation Model

GE-Base is a large-scale, instruction-conditioned video diffusion transformer (DiT) model, trained on the AgiBot-World-Beta dataset comprising over one million real-world, multi-view, instruction-aligned robotic manipulation episodes. The model formulates robotic world modeling as a text-and-image-to-video generation problem, where, given a language instruction and initial visual observation, it autoregressively predicts future video segments reflecting plausible robotic behaviors.

Key architectural features include:

Multi-view video generation: Simultaneous synthesis from head-mounted and dual wrist-mounted cameras, ensuring spatial consistency via cross-view self-attention.
Sparse memory mechanism: Augments current visual input with long-term historical context, enabling extended temporal reasoning.
Instruction grounding: Integrates T5-XXL-encoded language instructions via cross-attention, aligning video generation with task semantics.
Figure 2: GE-Base autoregressive video generation process with multi-view conditioning and cross-view causal blocks for spatial consistency.

The pretraining pipeline involves a two-stage process:

Multi-resolution temporal adaptation: Exposes the model to variable frame rates and motion speeds, enhancing robustness to temporal variation.
Low-frequency policy alignment: Fine-tunes the model at lower frame rates to match the temporal abstraction required for downstream action policy learning.
Figure 3: GE-Base training process, including domain adaptation and fine-tuning on large-scale, multi-view, instruction-aligned data.

GE-Base demonstrates strong generalization in generating temporally coherent, instruction-aligned multi-view videos across diverse manipulation tasks and embodiments.

Figure 4: Multi-view robotic manipulation videos generated by GE-Base, illustrating spatial and temporal consistency across tasks and environments.

GE-Act: World Action Model

GE-Act extends GE-Base with a lightweight, 160M-parameter autoregressive action decoder, mapping latent visual representations to temporally structured action policies. The architecture mirrors the DiT block depth of GE-Base but reduces hidden dimensions for efficiency. Visual latent features are integrated into the action pathway via cross-attention, and final action predictions are generated using a diffusion-based denoising flow-matching pipeline.

Figure 5: GE-Act architecture, showing the parallel action branch and cross-attention integration with visual latents.

The training pipeline consists of:

Action-space pretraining: Optimizes the action decoder on the AgiBot-World-Beta dataset, with the visual backbone frozen.
Task-specific adaptation: Two-stage fine-tuning—first adapting the video encoder to new visual domains, then fine-tuning the action head on task-specific control signals.
Figure 6: GE-Act training pipeline, illustrating pretraining and adaptation stages using text–video–policy triplets.

A notable inference optimization is the Slow-Fast Asynchronous Inference mode, which decouples the frequency and denoising complexity of video and action generation, enabling 54-step torque trajectory inference in 200 ms on commodity GPUs.

Real-World and Cross-Embodiment Performance

GE-Act is evaluated on a suite of real-world dual-arm manipulation tasks (e.g., sandwich assembly, tea pouring, table cleaning, microwave operation, conveyor-based packing) using both step-wise and end-to-end success metrics. It consistently outperforms state-of-the-art VLA baselines (UniVLA, GR00T N1) in both precision and robustness.

Figure 7: Task-specific real-world manipulation performance comparison on AgiBot G1, showing GE-Act's superior SR and E2E metrics.

Qualitative results confirm reliable, contextually appropriate execution of complex tasks.

Figure 8: Real-world manipulation on AgiBot G1 via GE-Act, demonstrating instruction-conditioned policy execution.

For cross-embodiment generalization, GE-Act is adapted to novel platforms (Agilex Cobot Magic, Dual Franka) using only one hour of teleoperated demonstrations. In complex deformable object tasks (cloth/box folding), GE-Act outperforms all baselines, including $\pi_0$ , UniVLA, and GR00T N1, which fail to generalize to fine-grained manipulation.

Figure 9: Real-world demonstration of GE-Act on Agilex Cobot Magic, showcasing generalization, deformable object handling, and memory-based decision making.

Figure 10: Multi-view video generation on Agilex Cobot Magic by GE-Base for complex folding tasks.

Figure 11: Real-world demonstrations with GE-Act on Agilex Cobot Magic, including cloth- and box-folding.

Figure 12: Robotic video generation and real-world manipulation on Dual Franka via GE.

GE-Sim: Video-Based World Simulator

GE-Sim repurposes GE-Base as an action-conditioned video generator, enabling closed-loop policy evaluation and controllable data generation. The simulator fuses spatial pose conditions and temporal motion deltas into the video generation process, supporting high-fidelity, action-aligned video rollouts.

Figure 13: GE-Sim world simulator architecture, illustrating action-conditioned video generation and closed-loop simulation.

Action-conditioned video generation demonstrates precise spatial alignment between control signals and predicted visual outcomes.

Figure 14: Action-conditioned video generation by GE-Sim, visualizing spatial alignment with intended control signals.

GE-Sim supports scalable, distributed simulation, enabling thousands of policy rollouts per hour, and serves as a data engine for generating diverse manipulation sequences under varied contexts.

EWMBench: Comprehensive Evaluation Suite

EWMBench is a domain-specific benchmark for video-based world models in robotic manipulation, measuring visual fidelity, physical consistency, and instruction-action alignment. The benchmark dataset comprises 10 representative tasks, each decomposed into atomic sub-actions with fine-grained annotations.

Evaluation metrics include:

Scene consistency: Patch-level feature similarity using DINOv2.
Action trajectory quality: Symmetric Hausdorff distance, normalized DTW, and dynamic consistency (Wasserstein distance on velocity/acceleration).
Motion semantics: VLM-based global/stepwise alignment, logical correctness, and semantic diversity (CLIP-based).
Figure 15: Comprehensive evaluation of video world models for robotic manipulation using EWMBench, comparing GE-Base to state-of-the-art baselines.

GE-Base achieves superior temporal alignment and dynamic consistency, outperforming general video generation models (Kling, Hailuo, COSMOS, LTX-Video, OpenSora) in control-aware generation fidelity. GE-Sim demonstrates high spatial, temporal, and semantic alignment in action-conditioned settings, with low diversity under fixed actions indicating precise control.

Metric-human consistency analysis shows EWMBench rankings align closely with human judgments, unlike generic video benchmarks.

Figure 16: Consistency and validity analysis of evaluation metrics, comparing EWMBench and VBench against human preference.

Limitations

Data diversity: Pretraining is limited to a single real-world dataset (AgiBot-World-Beta), restricting embodiment and scene diversity.
Embodiment scope: Current focus is on upper-body, parallel-jaw gripper manipulation; dexterous hands and full-body behaviors are not addressed.
Evaluation methodology: EWMBench, while comprehensive, still relies on proxy metrics and partial human validation; fully automated, robust task success assessment remains an open challenge.

Conclusion

Genie Envisioner establishes a unified, scalable foundation for instruction-driven, general-purpose embodied intelligence in robotic manipulation. By integrating high-fidelity video world modeling (GE-Base), efficient action policy inference (GE-Act), closed-loop simulation (GE-Sim), and comprehensive evaluation (EWMBench), the platform demonstrates strong real-world performance, cross-embodiment generalization, and robust policy evaluation. The release of code, models, and benchmarks is expected to accelerate research in scalable, vision-centric embodied AI. Future work should address data and embodiment diversity, and further refine evaluation protocols for broader applicability and reliability.

PDF Markdown

Follow-up Questions

Related Papers

Authors (14)

Tweets

https://twitter.com/_akhaliq/status/1953805812175937738

https://twitter.com/EmbodiedAIRead/status/1954658687668412661

https://twitter.com/agibotworld/status/1953839385717145864

https://twitter.com/taziku_co/status/1953963602983236092

https://twitter.com/gm8xx8/status/1955347702071345333

https://twitter.com/RevanthAtmakuri/status/1953833313216389291

YouTube

Show All Videos

alphaXiv

Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation (11 likes, 0 questions)