Papers
Topics
Authors
Recent
Search
2000 character limit reached

HY-World 2.0: Multi-Modal 3D World Model

Updated 4 July 2026
  • HY-World 2.0 is a unified multi-modal framework that reconstructs and generates 3D worlds from texts, images, and videos using a four-stage offline pipeline.
  • The system integrates panorama synthesis, heuristic trajectory planning, keyframe-based view generation, and feed-forward 3D prediction to produce high-fidelity 3D representations.
  • It leverages diffusion models, transformer-based tokenization, and 3D Gaussian Splatting to achieve state-of-the-art performance in 3D reconstruction and interactive rendering.

Searching arXiv for the target paper and directly related foundation papers to ground the article. HY-World 2.0 is a multi-modal world model framework for reconstructing, generating, and simulating 3D worlds from text prompts, single-view images, multi-view images, and videos. It extends HY-World 1.0 by unifying sparse-input generation with dense-input reconstruction in a four-stage offline pipeline that produces 3D Gaussian Splatting (3DGS) scenes, extracted meshes, and an interactive rendering environment through WorldLens (HY-World et al., 15 Apr 2026). The system is organized around panorama synthesis, trajectory planning, keyframe-based view generation, and feed-forward 3D prediction, with the stated goal of supporting both imaginative world generation and accurate multi-view reconstruction.

1. System definition and architectural scope

At the architectural level, HY-World 2.0 is described as a four-stage offline 3D world model. For text or single-view image inputs, it performs world generation; for multi-view images or videos, it performs world reconstruction. The end-to-end data flow is centered on four named components—HY-Pano 2.0, WorldNav, WorldStereo 2.0, and WorldMirror 2.0—followed by interactive rendering in WorldLens (HY-World et al., 15 Apr 2026).

Stage Component Output
I HY-Pano 2.0 360° panorama
II WorldNav Trajectories
III WorldStereo 2.0 Synthesized views
IV WorldMirror 2.0 + 3DGS 3DGS + mesh

Stage I turns text or a single perspective image into a high-fidelity equirectangular panorama. Stage II parses that panorama into a point cloud, mesh, and NavMesh, then plans informative camera trajectories. Stage III generates consistent novel keyframes along each trajectory using a camera-conditioned diffusion transformer with spatial-stereo and geometric memories. Stage IV reconstructs per-view depth and normals, aligns and fuses them into an expanded point cloud, then optimizes a 3D Gaussian Splatting representation and extracts a navigable mesh.

This organization makes HY-World 2.0 notable less as a single monolithic model than as a coordinated world-model stack. A plausible implication is that its contribution lies as much in system integration—generation, planning, 3D prediction, and rendering—as in any individual submodule.

HY-World 2.0 ingests four modalities: text prompts, single-view images, multi-view images, and videos. These inputs are first tokenized into a unified latent feature space. Text is embedded by a frozen CLIP text encoder. Single images or video keyframes are encoded by a Multi-Modal Diffusion Transformer (MMDiT) encoder or by the Keyframe-VAE encoder. Camera poses and intrinsics are projected into 7-D pose tokens, and monocular depth is optionally added as a channel or a token (HY-World et al., 15 Apr 2026).

Within the transformer backbones—MMDiT, Video-DiT, and WorldMirror—heterogeneous tokens are concatenated and fused via cross-attention:

Attn(Q,K,V)=Softmax ⁣(QK/d)V.\mathrm{Attn}(Q,K,V)=\mathrm{Softmax}\!\bigl(QK^\top/\sqrt{d}\bigr)\,V.

The paper characterizes this as “any-modal tokenization,” with the explicit effect that the network can adapt dynamically to whichever subset of modalities is present at inference time.

The significance of this design is methodological. Rather than maintaining separate pipelines for text-conditioned generation, image-conditioned generation, and multi-view reconstruction, HY-World 2.0 places these cases into a common token interface. This suggests an attempt to treat modality variation primarily as a conditioning problem inside transformer backbones rather than as an architectural bifurcation.

3. Panorama synthesis and heuristic trajectory planning

HY-Pano 2.0 is the system’s panorama-generation module. It is a latent-diffusion transformer that synthesizes a 360×180360^\circ \times 180^\circ equirectangular panorama conditioned either on a CLIP-embedded text prompt or on one or more perspective image latents. Internally, it uses a VAE to map images to and from latents and a U-Net-style DiT backbone. During training it minimizes the standard denoising diffusion loss

Ldiff=Et,x0,ϵϵϵθ(xt,t,cond)2,\mathcal{L}_{\mathrm{diff}} = \mathbb{E}_{t,x_0,\epsilon}\bigl\|\epsilon-\epsilon_\theta(x_t,t,\mathrm{cond})\bigr\|^2,

with

xt=αˉtx0+1αˉtϵ.x_t=\sqrt{\bar\alpha_t}\,x_0+\sqrt{1-\bar\alpha_t}\,\epsilon.

To ensure seamless wrap-around at the left/right boundary, it applies latent-space circular padding plus pixel-space linear blending at the ERP seams (HY-World et al., 15 Apr 2026).

WorldNav operates on the generated panorama. Given IpanI^{\mathrm{pan}}, it recovers a dense panoramic point cloud via MoGe2 alignment of monocular depth, then builds a low-resolution mesh and a NavMesh for collision avoidance, and runs SAM3/Qwen3-VL for semantic masks. The planner defines five heuristic path modes:

  • Regular: orbit three 120120^\circ sectors at ±45\pm45^\circ pitch.
  • Surrounding: circle around each major object segment.
  • Reconstruction-aware: detect stretched mesh faces, create keypoints, and orbit to fill holes.
  • Wandering: partition NavMesh into eight sectors and walk to the farthest reachable node.
  • Aerial: pitch up +45+45^\circ on other paths, dynamically clipped to avoid collisions.

No single global cost function is optimized. Instead, each mode is generated by ray-casting plus graph-search—specifically Dijkstra on the NavMesh—or by greedy arc connections. Table I reports the maximum counts for each mode, including regular up to 9 and surrounding up to 5 (HY-World et al., 15 Apr 2026).

A recurrent misconception would be to treat WorldNav as a learned optimal planner. The paper states the opposite: the planning logic is heuristic, with explicit mode construction and no single global objective. Its contribution is therefore coverage-oriented trajectory generation rather than end-to-end optimal control.

4. Keyframe-based expansion and universal 3D prediction

WorldStereo 2.0 extends camera-guided video diffusion into a keyframe latent space. Its first core change is Keyframe-VAE, in which each keyframe is encoded independently and there is no spatio-temporal compression; the stated effect is reduced motion blur and better preservation of high-frequency detail. The second is Global-Geometric Memory (GGM), which during middle training renders an extended point cloud from TgT_g novel views and augments depth with random downsampling and floaters to robustify the model. The third is Spatial-Stereo Memory++ (SSM++), which retrieves the most relevant keyframe or keyframes for each target latent, stitches them horizontally, inserts camera tokens, and processes them via full self-attention in the main DiT branch; this removes a separate memory branch and uses implicit positional embeddings. The fourth is Post-Train Distillation (DMD), which distills the multi-step DiT into a 4-step student GθG_\theta by minimizing the KL between real and fake score functions, allowing fast inference (HY-World et al., 15 Apr 2026).

The training objective for WorldStereo is the denoising diffusion loss augmented by memory regularizations. At inference, the model samples latents along the planned camera poses, decodes keyframes, and rasterizes them for 3D fusion.

WorldMirror 2.0, by contrast, is a unified feed-forward Transformer for universal 3D prediction. Given 360×180360^\circ \times 180^\circ0 views 360×180360^\circ \times 180^\circ1, it predicts dense depth maps, surface normals, per-view confidence masks, camera intrinsics and extrinsics corrections, and 3D Gaussian Splatting parameters 360×180360^\circ \times 180^\circ2. Its listed model improvements are Normalized Rotary PE, Depth-to-Normal loss, a depth mask head with binary cross-entropy, token-budget dynamic batching plus multi-stage curriculum, and inference acceleration via sequence-parallelism, BF16, and FSDP. After running WorldMirror on a subset of generated keyframes, the system obtains aligned depths and normals (HY-World et al., 15 Apr 2026).

Taken together, WorldStereo 2.0 and WorldMirror 2.0 divide expansion and reconstruction into two distinct roles: diffusion-based novel-view synthesis and feed-forward geometric consolidation. This suggests a deliberate separation between appearance completion and geometric estimation.

5. Final representation and the WorldLens rendering platform

The final 3D world is represented as a 3D Gaussian Splatting model plus an extracted mesh for physics. WorldLens is the rendering platform attached to this representation. Its listed features are an engine-agnostic C++/CUDA back end spanning OpenGL, Vulkan, Unreal, and Unity; automatic image-based lighting using an HDR panorama; real-time collision detection via the NavMesh and extracted mesh; and training-rendering co-design for efficient optimization and interactive exploration of 3D worlds with character support (HY-World et al., 15 Apr 2026).

The training-rendering co-design includes MaskGaussian, which prunes redundant Gaussians; regularization on scale; photometric 360×180360^\circ \times 180^\circ3 and geometric depth/normal losses; and mesh extraction via marching cubes on a TSDF volume. The rendering equation snippet given for 3DGS with MaskGaussian is

360×180360^\circ \times 180^\circ4

WorldLens therefore serves two functions simultaneously: a runtime environment for interactive inspection and a tightly coupled optimization target for the final 3DGS world representation. The emphasis on engine-agnostic deployment also places HY-World 2.0 within a broader simulation-and-graphics context rather than restricting it to offline reconstruction benchmarks.

6. Evaluation, runtime, and comparative position

The empirical evaluation is divided across panorama generation, trajectory planning, view generation, world composition, 3DGS ablation, overall world generation, runtime, and stand-alone WorldMirror experiments (HY-World et al., 15 Apr 2026). On T2P and I2P benchmarks, the reported panorama metrics are CLIP-T/I and Q-Align Qual/Aes, and HY-Pano 2.0 leads on 7/8 metrics. Progressive ablation of trajectory planning shows each trajectory mode filling in blind spots.

For WorldStereo 2.0, single-view reconstruction on Tanks-and-Temples and MipNeRF360 improves F1 from 36% to 41% and AUC from 51% to 58%. Camera control metrics—RotErr, TransErr, and ATE—as well as Q-Align and CLIP-IQA are also reported as improving. In ablations, GGM plus SSM++ improves PSNR/SSIM, consistency, and camera precision, while DMD yields a 4-step model with no loss in consistency.

In world composition, the paper highlights linear depth alignment via rendered guidance and reports that, compared to video2world (ICCV ’26), its 3DGS pipeline solves floaters and avoids 5 h ICP. In the 3D Gaussian Splatting ablation, MaskGaussian plus non-sky densification preserves PSNR at 25.02 dB while reducing the number of Gaussians from 6 M to 1.38 M. For overall world generation against Marble, the reported qualitative outcome is that HY-World 2.0 more faithfully matches the input and exhibits sharper textures and fewer artefacts.

Runtime is reported end-to-end on NVIDIA H20 as 15 s for panorama generation, 182 s for navigation, 286 s for expansion, 102 s for reconstruction and alignment, and 127 s for 3DGS, totaling 712 s, approximately 12 m. Stand-alone WorldMirror 2.0 results include point map evaluation on 7-Scenes, NRGBD, and DTU; pose, depth, and NVS evaluation; normals; strong resolution generalization across L/M/H attributed to normalized RoPE; gains from prior injection; and inference scaling to 256 views via SP+BF16+FSDP.

Within the scope stated by the paper, these results position HY-World 2.0 as state of the art on several benchmarks among open-source approaches and as comparable in output quality to the closed-source model Marble.

7. Release status, limitations, and future directions

HY-World 2.0 is released with model weights, code, and technical details intended to facilitate reproducibility and further research (HY-World et al., 15 Apr 2026). The released artifacts include HY-Pano 2.0 weights, WorldNav C++/Python code, a WorldStereo 2.0 checkpoint plus distillation scripts, WorldMirror 2.0 code plus all priors, 3DGS training and rendering tools through WorldLens, full training configs and data links, and detailed tutorials, Colab notebooks, and Docker images. The project page is listed as https://3d-models.hunyuan.tencent.com/world/, and the GitHub repository as github.com/Tencent-Hunyuan/HY-World 2.0.

The limitations are explicit. The system is offline only, with no real-time video-in to world capability. Diffusion expansion can hallucinate physically implausible geometry in heavily occluded regions. Monocular depth guidance remains imperfect outdoors, and alignment can fail if guidance is too sparse. Trajectory heuristics are described as strong but not globally optimal; reinforcement learning–based planners are identified as a possible route to improved coverage.

The future directions listed in the paper are 4D or temporal dynamics, online interactive world updates with user or robot actions, tighter integration with physical simulators and embodied agents, multi-agent exploration and planning, and differentiable NavMesh and collision gradients for end-to-end learning. These directions clarify the current boundary of the framework: HY-World 2.0 is a comprehensive offline foundation for generation and reconstruction, but not yet an online, temporally adaptive, or fully embodied world model.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to HY-World 2.0.