EmbodiedGen: Towards a Generative 3D World Engine for Embodied Intelligence (2506.10600v2)

Published 12 Jun 2025 in cs.RO and cs.CV

Abstract: Constructing a physically realistic and accurately scaled simulated 3D world is crucial for the training and evaluation of embodied intelligence tasks. The diversity, realism, low cost accessibility and affordability of 3D data assets are critical for achieving generalization and scalability in embodied AI. However, most current embodied intelligence tasks still rely heavily on traditional 3D computer graphics assets manually created and annotated, which suffer from high production costs and limited realism. These limitations significantly hinder the scalability of data driven approaches. We present EmbodiedGen, a foundational platform for interactive 3D world generation. It enables the scalable generation of high-quality, controllable and photorealistic 3D assets with accurate physical properties and real-world scale in the Unified Robotics Description Format (URDF) at low cost. These assets can be directly imported into various physics simulation engines for fine-grained physical control, supporting downstream tasks in training and evaluation. EmbodiedGen is an easy-to-use, full-featured toolkit composed of six key modules: Image-to-3D, Text-to-3D, Texture Generation, Articulated Object Generation, Scene Generation and Layout Generation. EmbodiedGen generates diverse and interactive 3D worlds composed of generative 3D assets, leveraging generative AI to address the challenges of generalization and evaluation to the needs of embodied intelligence related research. Code is available at https://horizonrobotics.github.io/robot_lab/embodied_gen/index.html.

Summary

The paper presents a generative platform that produces scalable, photorealistic 3D assets with accurate physical properties for embodied AI research.
It details innovative pipelines including image-to-3D, text-to-3D, and articulated object generation that ensure interactive and simulator-ready environments.
The framework’s open-source ecosystem fosters community-driven progress in digital twinning, data augmentation, and advanced simulation techniques.

EmbodiedGen: Generative Platform for 3D World Simulation

The paper presents EmbodiedGen, an advanced generative platform for constructing interactive 3D worlds designed to fulfill the needs of embodied intelligence research. This initiative seeks to address the prevailing limitations in scalability and realism of traditional 3D assets and proposes a novel solution using generative models to produce diverse and accurate 3D assets. The focus is on enabling large-scale, low-cost generation of photorealistic 3D objects with verifiable physical properties, applicable directly within various simulation environments.

Core Contributions

EmbodiedGen introduces a comprehensive toolkit facilitating manifold functionalities crucial for 3D world generation:

Toolkit for Interactive 3D World Generation: EmbodiedGen is positioned as the foundational toolkit for creating virtual environments tailored for embodied AI research. It supports the generation of diverse and interactive 3D assets and scenes, enhancing applications such as digital twinning, data augmentation, and embodied intelligence simulations.
Simulator-Ready, Physically Accurate Assets: The framework achieves merging high visual fidelity with physical realism, providing watertight geometry, true-to-scale assets, with dual representations in both 3D Gaussian Splatting (3DGS) and mesh formats. This synergetic approach ensures reliability in simulation and robustness in subsequent tasks.
Accessibility and Open-Source Ecosystem: EmbodiedGen is released as an open-source resource, bolstering community engagement and encouraging innovations in the field of embodied intelligence through flexible and scalable pipelines.

Methodology

The EmbodiedGen framework is organized into key modules:

Image-to-3D: Utilizes large models, including Trellis, for converting single images into detailed 3D objects. It incorporates automated quality inspection to assure asset fidelity while optimizing texture resolution.
Text-to-3D: Operates through a two-stage process, first generating high-quality images via text prompts and then converting these into 3D models using established image-to-3D services. The modular nature of this process enables early error-checking and scalability in asset generation.
Articulated Object Generation: Designed for the creation of mechanically complex assets, including articulated models from dual-state image inputs. This is particularly beneficial for simulation environments requiring interaction with complex entities.
Texture Generation: Applies the GeoLifter module, which involves geometric aware conditioning of diffusion models to perform consistent texture generation across different views, providing enhanced visual characteristics to 3D meshes.
Scene Generation: Offers panoramic view generation from text or image inputs, employing tools such as Pano2Room and enforces real-world scale restoration to ensure practical applicability in simulations.

Numerical Results and Claims

Despite the detailed modularity and extensive functional coverage, the paper outlines a non-disclosure approach towards numeric evaluation. The emphasis is placed on demonstrating improvements in visual fidelity and physical accuracy without numerical performance benchmarks. However, claims regarding scalability, diversity, and physical realness are extensively backed by architectural and procedural advancements.

Implications and Future Perspectives

The implications of the EmbodiedGen system are multifaceted, aiming to reshape experimental paradigms in embodied intelligence by reducing barriers posed by traditional asset generation methods. The capabilities for digital twinning, and data augmentation have immediate practical implications within fields such as robotics, simulation-based training, and autonomous systems.

Looking ahead, as generative models become even more refined, EmbodiedGen is positioned to harness these developments, potentially integrating emerging techniques like advanced diffusion models to further enhance the realism and efficiency of 3D asset creation. This ongoing evolution signals significant advancements in embodied AI applications, paving the way for more sophisticated simulations and robust machine interactions within virtual environments.