WorldMirror 2.0: Unified 3D Reconstruction
- The paper presents WorldMirror 2.0, a reconstruction backbone that reliably recovers detailed 3D scene geometry from multi-view images with optional geometric priors.
- It employs a Transformer backbone with normalized RoPE and explicit depth-to-normal coupling to overcome resolution and consistency limitations of its predecessor.
- The model enhances scalability and efficiency through token-budget dynamic batching and precise alignment techniques, supporting rapid high-view reconstructions in HY-World 2.0.
WorldMirror 2.0 is the reconstruction backbone of HY-World 2.0: a unified, feed-forward multi-view 3D prediction model that takes a set of images together with optional geometric priors and predicts a consistent geometric description of the scene, including point maps, depth, normals, camera parameters, and per-pixel 3D Gaussian Splatting attributes. Within HY-World 2.0 it serves both as a standalone world reconstruction model for multi-view images or videos and as the geometry extractor in the final “World Composition” stage, where generated keyframes are converted into geometry that can be aligned, fused, and optimized into a navigable 3DGS world (HY-World et al., 15 Apr 2026).
1. System role and problem scope
WorldMirror 2.0 occupies the reconstructive component of HY-World 2.0. It is not the module that initializes panoramas, plans exploration trajectories, or synthesizes new views. Those roles are assigned to HY-Pano 2.0, WorldNav, and WorldStereo 2.0, respectively. WorldMirror 2.0 instead recovers 3D structure from existing observations, either in standalone reconstruction from multi-view images or videos, or inside the generative pipeline during world composition (HY-World et al., 15 Apr 2026).
In the world-composition stage, the paper summarizes its role by
where is WorldMirror 2.0, are perspective views subdivided from the input panorama and their cameras, and are the selected generated keyframes and their cameras. The outputs are per-frame depth and normals (HY-World et al., 15 Apr 2026).
A central design claim is that WorldMirror 2.0 is reconstructive rather than generative. It does not hallucinate new views, and the paper does not state that it accepts text. Text conditioning exists upstream in HY-Pano 2.0, while WorldMirror 2.0 operates on multi-view observations and optional geometric priors. This distinction is important because it places the model at the interface between 2D image generation and explicit 3D scene composition rather than among multimodal image synthesizers (HY-World et al., 15 Apr 2026).
2. Inputs, outputs, and scene representation
A defining feature inherited from the earlier WorldMirror formulation is “Any-Modal Tokenization,” under which all available inputs are tokenized into a common sequence. The paper states that all input modalities—images, camera poses, intrinsics, and depth maps—are tokenized into a unified sequence, and that each prior modality is dropped independently with probability $0.5$ during training so that the model can operate under arbitrary missing-modality conditions (HY-World et al., 15 Apr 2026).
The supported inputs are multi-view RGB images, video frames treated as multiple views, camera poses, camera intrinsics, and depth maps as optional priors. The paper’s figure caption explicitly describes WorldMirror 2.0 as taking “multi-view images with optional geometric priors (camera poses, intrinsics, depth maps) as input.” The paper does not state that WorldMirror 2.0 accepts text; that functionality belongs to other HY-World 2.0 modules (HY-World et al., 15 Apr 2026).
The model is multi-task. It predicts dense point maps or point clouds, multi-view depth maps, surface normals, camera parameters, and pixel-wise 3D Gaussian Splatting attributes. In WorldMirror 2.0 an additional depth mask head is added for invalid-pixel modeling. The paper distinguishes between primary geometric outputs—point maps, depth, normals, and camera estimation—and 3DGS-related outputs. In the broader HY-World 2.0 pipeline, the final world is not produced by directly taking the feed-forward Gaussian prediction as the final scene; instead, WorldMirror 2.0 provides geometry that initializes and supervises a separate 3DGS optimization stage (HY-World et al., 15 Apr 2026).
The treatment of video is also explicit: frames are handled as multi-view observations. The paper does not describe a dedicated temporal module, temporal loss, or recurrent memory. Temporal consistency is therefore implicit in multi-view processing rather than enforced by a video-specific mechanism. Likewise, the paper does not describe an explicit occlusion module inside WorldMirror 2.0; occlusion handling is partly absorbed by improved invalid-pixel prediction and by downstream alignment filters in world composition (HY-World et al., 15 Apr 2026).
3. Architecture and the transition from WorldMirror 1.0
WorldMirror 2.0 uses a Transformer backbone with global-local attention mechanisms, inherited from WorldMirror 1.0, followed by task-specific DPT decoder heads. A major architectural change is the replacement of standard absolute-index 2D RoPE with normalized RoPE. For a patch grid , with and , normalized patch coordinates are defined as
0
with 1. The stated purpose is to map all resolutions into the same coordinate range so that test-time extrapolation becomes interpolation (HY-World et al., 15 Apr 2026).
The paper motivates the 2.0 upgrade by three shortcomings of WorldMirror 1.0: poor generalization to inference resolutions different from training resolution, limited geometric consistency between depth and normals, and memory or latency bottlenecks when scaling to many views. WorldMirror 2.0 addresses these through positional encoding changes, explicit geometric coupling, improved invalid-pixel handling, revised data, a new curriculum, and distributed inference strategies (HY-World et al., 15 Apr 2026).
The main differences are concise enough to tabulate.
| Aspect | WorldMirror 1.0 | WorldMirror 2.0 |
|---|---|---|
| Position encoding | absolute RoPE | normalized RoPE |
| Depth supervision | GT depth only | GT depth + GT/pseudo normal through explicit depth-to-normal coupling |
| Invalid-pixel modeling | confidence only | confidence + dedicated depth-mask head |
| Acceleration | none | token/frame sequence parallelism, BF16, FSDP |
| Data | open-source datasets | plus internal Unreal Engine synthetic renderings |
| Pseudo-label enhancement | not normal-focused | pseudo normal labels added |
| Training strategy | independent sampling of view count and resolution; 2-stage curriculum | token-budget dynamic batching; 3-stage curriculum; broader resolution range |
The model preserves the shared-backbone, multi-head philosophy of the earlier “WorldMirror: Universal 3D World Reconstruction with Any-Prior Prompting,” which is described as an all-in-one, feed-forward model for versatile 3D geometric prediction tasks that flexibly integrates camera poses, intrinsics, and depth maps while jointly predicting dense point clouds, multi-view depth maps, camera parameters, surface normals, and 3D Gaussians (Liu et al., 12 Oct 2025). WorldMirror 2.0 specializes that prior-aware formulation for the HY-World 2.0 stack and adds explicit machinery for resolution robustness, depth-normal consistency, and scalable multi-view inference (HY-World et al., 15 Apr 2026).
A practical misconception addressed by the paper is that WorldMirror 2.0 might behave like WorldStereo 2.0. It does not. The paper explicitly states that it has no memory bank, retrieval module, or temporal recurrence; its consistency derives from jointly processing all views in one Transformer pass, aided by optional geometric priors (HY-World et al., 15 Apr 2026).
4. Learning strategy, geometric coupling, and invalid-pixel modeling
WorldMirror 2.0 uses a three-stage curriculum. In Stage 1, all geometry heads are trained using native annotations, without pseudo-label enhancement and without depth-to-normal loss. In Stage 2, the depth-to-normal loss is introduced and the synthetic-data proportion is increased. In Stage 3, the backbone and geometry heads are frozen, and only the 3DGS head initialized from the depth head weights is trained. This extends the two-stage strategy of WorldMirror 1.0 into a geometry-first, geometry-coupled, then 3DGS-specialized schedule (HY-World et al., 15 Apr 2026).
The paper also introduces token-budget dynamic batching. For a GPU token budget 2, after sampling image resolution and patch size, the per-image token count is
3
The maximum number of views is then
4
and the total per-GPU token budget is constrained by
5
This permits many low-resolution views or fewer high-resolution views under a fixed token budget (HY-World et al., 15 Apr 2026).
One of the central additions is explicit depth-to-normal coupling. Given predicted depth 6 and intrinsics 7, the model back-projects into 3D as
8
and derives a normal by
9
The angular supervision loss is
0
with targets derived from GT depth in synthetic data and from pseudo normals produced by a monocular normal teacher on real data. The paper argues against pseudo-depth labels for multi-view training because per-view pseudo-depth is often globally inconsistent, whereas normals encode local orientation more robustly (HY-World et al., 15 Apr 2026).
Invalid depth pixels are handled by a dedicated depth mask head. It predicts a per-pixel validity logit 1 with BCE supervision:
2
This makes invalid-pixel prediction explicit rather than relying only on confidence weighting, which was the WorldMirror 1.0 design (HY-World et al., 15 Apr 2026).
The training data combine public multi-view reconstruction datasets, high-quality synthetic Unreal Engine scenes, and pseudo normals on real data. The paper does not specify an epipolar-attention mechanism, explicit reprojection loss, or bundle-adjustment-style optimization inside WorldMirror 2.0. Its consistency is learned from multi-view supervision and shared Transformer processing rather than from iterative geometric optimization at inference time (HY-World et al., 15 Apr 2026).
5. World composition, depth alignment, and integration with 3DGS
Inside HY-World 2.0, world composition begins from a panorama 3, a panoramic point cloud 4, generated keyframes 5, and corresponding cameras 6. A subset of generated frames is sampled, and WorldMirror 2.0 predicts depth and normals for them using both panorama-derived perspective views and selected generated views. The next step is alignment, because WorldMirror depth has scale ambiguity and may not match the panoramic point-cloud frame (HY-World et al., 15 Apr 2026).
The paper renders the panoramic point cloud into each target camera to obtain guidance depth 7, then aligns predicted depth 8 by
9
where the valid mask is
0
Here 1 denotes valid WorldMirror regions, 2 valid panorama-guidance regions, 3 normal-consistent regions, 4 a percentile-based depth-discrepancy filter, and 5 the non-sky mask (HY-World et al., 15 Apr 2026).
The aligned depth is modeled as a per-frame linear transform,
6
implemented in disparity space in practice. To reject poor alignments, the method uses anchor values 7 with 8, computes transformed anchors 9, aggregates medians across frames, and replaces outlier coefficients when the maximum relative deviation exceeds the 90th percentile. This outlier-rejection procedure is a notable component of the system’s synthetic-to-world alignment logic (HY-World et al., 15 Apr 2026).
After alignment, the depths are back-projected to obtain an extension point cloud 0, and the union with panorama points is voxel-downsampled:
1
This initializes the final 3DGS scene. Each Gaussian has opacity 2, center 3, covariance 4, and color 5, with
6
The final renderer is optimized with photometric and geometric losses, and the geometric term explicitly uses aligned WorldMirror depth and normals:
7
In this sense, WorldMirror 2.0 is not merely a predictor of geometric intermediates; it also supplies initialization geometry, aligned supervision, and stabilization signals for the downstream 3DGS optimization (HY-World et al., 15 Apr 2026).
6. Empirical profile, scaling behavior, and limitations
The paper attributes strong benchmark gains to WorldMirror 2.0. On point-map reconstruction across 7-Scenes, NRGBD, and DTU, it improves over WorldMirror 1.0 at all resolutions and especially resolves the high-resolution collapse of the earlier model. Reported examples include a 7-Scenes medium accuracy mean improvement from 0.043 → 0.033, a 7-Scenes high accuracy mean improvement from 0.079 → 0.037, a 7-Scenes high + all priors result of 0.012, and a DTU high + all priors result of 0.554 acc mean / 0.771 comp mean (HY-World et al., 15 Apr 2026).
On RealEstate10K, the reported camera AUC@30 improves from 66.29 to 86.89 at high resolution; depth AbsRel improves from 0.195 to 0.162 at high resolution; and 8 reaches 0.815 at high resolution. On novel-view synthesis, WorldMirror 1.0 is reported to collapse at high resolution, with PSNR 21.34 (M) → 17.78 (H), whereas WorldMirror 2.0 remains stable at 20.14 / 20.07 / 19.98 across low, medium, and high resolution, with high-resolution SSIM reaching 0.726. On surface normals, the paper reports ScanNet mean error 12.3 (M), 12.5 (H) for WorldMirror 2.0, while WorldMirror 1.0 degrades from 13.8 (M) to 17.6 (H) (HY-World et al., 15 Apr 2026).
The scaling strategy is equally prominent. The model is evaluated up to 256 views and uses token-level sequence parallelism in the Transformer, frame-level sequence parallelism in DPT decoders, BF16 mixed precision for most parameters, and FSDP sharding. At 9 on NVIDIA H20, the paper reports that a baseline FP32 single GPU is OOM at 256 views, whereas full SP + BF16 + FSDP on 4 GPUs runs 128 views in 5.60 s, 42.71 GB/GPU and 256 views in 17.52 s, 78.78 GB/GPU. In the full HY-World 2.0 pipeline, the “Recon and Align” stage, which includes WorldMirror 2.0 plus depth alignment, takes 102 s. The paper also compares its linear alignment pipeline against video2world and claims similar quality with less than 2 minutes runtime versus about 5 hours per scene for feature-matched ICP in video2world (HY-World et al., 15 Apr 2026).
The limitations are also explicit. WorldMirror 2.0 is not explicitly tailored for panoramic reconstruction. Its predicted depth has scale ambiguity and must be aligned to panorama coordinates. The paper states that it still struggles in highly challenging outdoor scenes, that bad alignments can occur when guidance masks are sparse or trajectories are difficult, and that it has no explicit temporal memory or recurrent mechanism. It also does not provide exact loss weights, optimizer settings, or internal 3DGS-head parameterization in the reported text. These limitations are consistent with its design emphasis: a feed-forward, prior-aware reconstruction engine that improves substantially over the prior WorldMirror formulation, but that still relies on downstream alignment and optimization to produce the final world representation (HY-World et al., 15 Apr 2026).