HunyuanWorld 1.0: Generating Immersive, Explorable, and Interactive 3D Worlds from Words or Pixels (2507.21809v1)

Published 29 Jul 2025 in cs.CV

Abstract: Creating immersive and playable 3D worlds from texts or images remains a fundamental challenge in computer vision and graphics. Existing world generation approaches typically fall into two categories: video-based methods that offer rich diversity but lack 3D consistency and rendering efficiency, and 3D-based methods that provide geometric consistency but struggle with limited training data and memory-inefficient representations. To address these limitations, we present HunyuanWorld 1.0, a novel framework that combines the best of both worlds for generating immersive, explorable, and interactive 3D scenes from text and image conditions. Our approach features three key advantages: 1) 360{\deg} immersive experiences via panoramic world proxies; 2) mesh export capabilities for seamless compatibility with existing computer graphics pipelines; 3) disentangled object representations for augmented interactivity. The core of our framework is a semantically layered 3D mesh representation that leverages panoramic images as 360{\deg} world proxies for semantic-aware world decomposition and reconstruction, enabling the generation of diverse 3D worlds. Extensive experiments demonstrate that our method achieves state-of-the-art performance in generating coherent, explorable, and interactive 3D worlds while enabling versatile applications in virtual reality, physical simulation, game development, and interactive content creation.

Summary

The paper introduces a hybrid pipeline that generates 360° panoramic proxies from text or image input, enhancing both 2D diversity and 3D consistency.
The methodology combines semantic layering, depth estimation, and mesh reconstruction to produce high-fidelity, explorable, and interactive 3D environments.
Key results demonstrate state-of-the-art performance in image-to-panorama and text-to-world generation, supporting VR, simulation, and game development applications.

HunyuanWorld 1.0: A Framework for Immersive, Explorable, and Interactive 3D World Generation from Text or Images

Introduction and Motivation

HunyuanWorld 1.0 addresses the challenge of generating high-fidelity, explorable, and interactive 3D worlds from either textual descriptions or single images. The framework is motivated by the limitations of existing approaches: video-based methods offer visual diversity but lack true 3D consistency and are computationally expensive, while 3D-based methods provide geometric consistency but are constrained by limited data and inefficient scene representations. HunyuanWorld 1.0 proposes a hybrid solution, leveraging panoramic images as 360° world proxies and introducing a semantically layered 3D mesh representation to enable structured, interactive, and exportable 3D worlds.

Figure 1: HunyuanWorld 1.0 generates immersive, explorable, and interactive 3D worlds from text or image inputs, supporting mesh export and object-level interactivity.

System Architecture and Pipeline

The core architecture of HunyuanWorld 1.0 is a staged generative pipeline that integrates 2D generative models with 3D reconstruction. The process begins with the generation of a panoramic image (serving as a world proxy) conditioned on either text or image input. This panorama is then decomposed into semantic layers (sky, background, and object layers) using a combination of vision-LLMs and segmentation techniques. Each layer undergoes depth estimation and alignment, followed by mesh-based 3D reconstruction. The resulting layered mesh representation supports object disentanglement, enabling downstream interactivity and compatibility with standard graphics pipelines.

Figure 2: The HunyuanWorld 1.0 architecture: from input (text/image) to panoramic proxy, semantic layering, depth alignment, and mesh-based 3D world reconstruction.

Panoramic Proxy Generation

Text-to-Panorama: Utilizes a diffusion transformer (Panorama-DiT) conditioned on LLM-enhanced prompts to generate high-fidelity 360° panoramas.
Image-to-Panorama: Projects the input image into panoramic space using estimated camera intrinsics, then completes the panorama via diffusion, guided by scene-aware prompts to avoid object duplication.

Semantic Layering and Object Decomposition

Instance Recognition: Employs VLMs for semantic object identification, followed by object detection (Grounding DINO) and segmentation (ZIM) with circular padding to handle equirectangular discontinuities.
Layer Decomposition: Iteratively removes objects and inpaints occluded regions to separate background and sky layers, using a fine-tuned diffusion model for layer completion.

Layer-wise 3D Reconstruction

Depth Estimation and Alignment: Predicts depth for each layer, aligning across layers to ensure geometric consistency.
Mesh Generation: Warps each layer into 3D using grid mesh representations (WorldSheet), with special handling for polar regions and mesh boundaries. Foreground objects can be reconstructed via direct projection or by leveraging image-to-3D models (e.g., Hunyuan3D).
Sky and Background: The sky is rendered as a distant mesh or HDRI map; background layers undergo adaptive depth compression before mesh conversion.

Long-Range World Extension

To enable exploration beyond the initial proxy, HunyuanWorld 1.0 incorporates a video-based view completion model (Voyager) that synthesizes spatially coherent RGB-D videos along arbitrary camera trajectories. A world caching mechanism maintains and updates a 3D point cloud cache, ensuring spatial consistency and supporting long-range navigation.

Data Curation and Training

A dedicated panoramic data curation pipeline sources images from commercial, open, and synthetic (Unreal Engine) datasets. Quality is enforced via automated and manual filtering. Captioning leverages a three-stage pipeline: re-captioning for detail, LLM-based distillation for hierarchical prompts, and human verification for semantic fidelity. For image-to-panorama, scene-aware prompts are generated to avoid object duplication and encourage holistic scene synthesis.

Figure 3: The panoramic data curation pipeline ensures high-quality, richly annotated training data for both text and image conditions.

System Optimization

Mesh Storage: Dual compression strategies are employed—XAtlas-based decimation and UV parameterization for offline use, and Draco compression for web deployment, achieving up to 90% size reduction.
Inference Acceleration: TensorRT-based optimization, intelligent caching, and multi-GPU parallelization enable efficient model inference, supporting real-time or near-real-time 3D world generation.

Experimental Results

Panorama Generation

HunyuanWorld 1.0 demonstrates superior performance in both image-to-panorama and text-to-panorama tasks, outperforming Diffusion360, MVDiffusion, PanFusion, and LayerPano3D in BRISQUE, NIQE, Q-Align, and CLIP-based alignment metrics. Qualitative results show improved visual coherence, reduced geometric distortion, and better semantic alignment.

Figure 4: Image-to-panorama generation results by HunyuanWorld 1.0.

Figure 5: Text-to-panorama generation results by HunyuanWorld 1.0.

Figure 6: Qualitative comparison for image-to-panorama generation (World Labs).

Figure 7: Qualitative comparison for image-to-panorama generation (Tanks and Temples).

Figure 8: Qualitative comparison for text-to-panorama generation (case 1).

Figure 9: Qualitative comparison for text-to-panorama generation (case 2).

3D World Generation

For both image-to-world and text-to-world tasks, HunyuanWorld 1.0 achieves state-of-the-art results, surpassing WonderJourney, DimensionX, LayerPano3D, and Director3D in all quantitative metrics. The generated 3D worlds exhibit high visual fidelity, geometric consistency, and strong semantic alignment with the input.

Figure 10: Visual results of text-to-world generation by HunyuanWorld 1.0.

Figure 11: Visual results of image-to-world generation by HunyuanWorld 1.0.

Figure 12: Qualitative comparisons for image-to-world generation.

Figure 13: Qualitative comparisons for text-to-world generation.

Applications and Implications

Figure 14: HunyuanWorld 1.0 enables applications in VR, physical simulation, game development, and interactive object manipulation.

Virtual Reality: 360° panoramic proxies and mesh-based worlds enable seamless VR experiences with full spatial coverage.
Physical Simulation: Exportable meshes and disentangled objects facilitate integration with physics engines for simulation tasks.
Game Development: Diverse, style-rich 3D worlds can be directly imported into Unity, Unreal Engine, and other industry-standard platforms.
Object Interaction: Instance-level object modeling supports precise manipulation and interaction within generated scenes.

Theoretical and Practical Implications

HunyuanWorld 1.0 demonstrates that panoramic proxies, when combined with semantic layering and mesh-based reconstruction, can bridge the gap between 2D generative diversity and 3D geometric consistency. The disentanglement of objects at the mesh level is a notable advancement, enabling interactive applications and compatibility with existing graphics pipelines. The integration of video-based world extension further addresses the challenge of long-range exploration, a limitation in prior 3D world generation frameworks.

Future Directions

Potential future developments include:

Scaling to larger, more diverse datasets for improved generalization.
Enhancing the fidelity of object-level reconstruction, especially for complex or occluded objects.
Integrating real-time user feedback for interactive world editing.
Extending the framework to support dynamic scenes and physics-based interactions.
Exploring multi-modal conditioning (e.g., audio, haptics) for richer immersive experiences.

Conclusion

HunyuanWorld 1.0 presents a comprehensive framework for generating immersive, explorable, and interactive 3D worlds from text or image inputs. By leveraging panoramic proxies, semantic layering, and mesh-based reconstruction, it achieves state-of-the-art performance in both panorama and 3D world generation tasks. The system's support for mesh export and object disentanglement enables a wide range of applications in VR, simulation, and game development. The approach sets a strong baseline for future research in world-level 3D content creation and interactive AI-driven environments.

PDF Markdown

Follow-up Questions

Related Papers

Authors (55)

First 10 authors:

Tweets

https://twitter.com/newlinedotco/status/1951858466244403498

https://twitter.com/flopsy42/status/1950526590720315567

https://twitter.com/cristian_le0/status/1950670856633676030

https://twitter.com/imohitmayank/status/1952029264246964517

https://twitter.com/newlinedotco/status/1951496404821516325

https://twitter.com/newlinedotco/status/1951496427961463099

YouTube

Show All Videos

alphaXiv

HunyuanWorld 1.0: Generating Immersive, Explorable, and Interactive 3D Worlds from Words or Pixels (41 likes, 0 questions)