Papers
Topics
Authors
Recent
Search
2000 character limit reached

WorldStereo 2.0: Keyframe Diffusion Model

Updated 4 July 2026
  • WorldStereo 2.0 is a keyframe-based generative model that transforms panoramas into camera-aligned multi-view keyframes using a camera-guided video diffusion transformer.
  • It leverages Global-Geometric Memory (GGM) and an upgraded SSM++ to enforce geometric consistency and high visual fidelity across generated views.
  • The system refines latent space modeling with a Keyframe-VAE and memory-aware distillation to reduce motion blur and enhance camera control under large motions.

Searching arXiv for the cited papers to ground the article. (Zhang et al., 2 Mar 2026) WorldStereo 2.0 is a camera-guided video diffusion transformer in a keyframe latent space with consistent memory, introduced as the world-expansion stage of HY-World 2.0 and designed to transform an initial panorama into multi-trajectory, multi-view keyframes that remain camera-aligned and mutually consistent across a large region (HY-World et al., 15 Apr 2026). Conceptually, it extends the earlier WorldStereo framework, which bridged camera-guided video generation and 3D reconstruction via 3D geometric memories, by shifting from dense video-latent generation to keyframe generation, retaining Global-Geometric Memory (GGM), replacing the original Spatial-Stereo Memory (SSM) with SSM++, and performing memory-aware Distribution Matching Distillation (DMD) (Zhang et al., 2 Mar 2026).

1. Definition and system role

Within HY-World 2.0, WorldStereo 2.0 occupies the third stage of a four-stage offline world-generation pipeline: panorama generation by HY-Pano 2.0, trajectory planning by WorldNav, world expansion by WorldStereo 2.0, and world composition by WorldMirror 2.0 plus 3D Gaussian Splatting rendered through WorldLens (HY-World et al., 15 Apr 2026). Its direct inputs are the panorama and its perspective subdivisions {Vjpan,Cjpan}j=1Tpan\{\mathbf{V}^{pan}_j, \mathbf{C}^{pan}_j\}_{j=1}^{T_{pan}}, the planned trajectories {Ci}i=1Tex\{\mathbf{C}_i\}_{i=1}^{T_{ex}}, and the panoramic point cloud Ppan\mathbf{P}^{pan} as global geometry. Its outputs are generated keyframe images {Vi}\{\mathbf{V}_i\} that adhere to the prescribed camera poses, maintain high per-frame visual quality through the Keyframe-VAE, and preserve multi-view consistency across trajectories through GGM and SSM++ (HY-World et al., 15 Apr 2026).

The earlier WorldStereo formulation defined the underlying problem more broadly: given a single input image, perspective or panorama, generate multiple camera-controlled videos along different trajectories and use them to reconstruct a high-quality 3D scene (Zhang et al., 2 Mar 2026). WorldStereo 2.0 preserves that “world model” orientation but specializes it for keyframe-based view generation inside the larger HY-World 2.0 stack. A common source of confusion is nomenclature: the standalone paper is titled “WorldStereo” rather than “WorldStereo 2.0,” whereas the latter designation is used in HY-World 2.0 for the upgraded keyframe-based model with consistent memory (Zhang et al., 2 Mar 2026).

This positioning is technically important. WorldStereo 2.0 does not itself output a final 3D representation; rather, it generates 3D-friendly keyframes that are subsequently consumed by WorldMirror 2.0 for depth and point-map reconstruction and by the 3DGS stage for optimization of the final navigable world (HY-World et al., 15 Apr 2026). This suggests that its primary function is not end-to-end reconstruction, but controlled expansion of view coverage under geometric constraints.

2. Architectural formulation in keyframe latent space

WorldStereo 2.0 is built on a camera-conditioned Video Diffusion Transformer (DiT) backbone, but its most consequential architectural change relative to WorldStereo is the replacement of Video-VAE latent modeling with a Keyframe-VAE that operates purely spatially, without temporal compression (HY-World et al., 15 Apr 2026). Standard latent video diffusion models use a Video-VAE with spatio-temporal compression, which in the HY-World 2.0 description introduces motion blur, geometric distortions and “melting” under strong camera movement, thereby degrading 3D reconstruction fidelity. The Keyframe-VAE instead treats each keyframe independently:

{Vi}i=11+TkfKeyframe-VAE{Fi}i=11+Tkf,FiR1×H8×W8×C.\{\mathbf{V}_i\}_{i=1}^{1+T_{kf}} \xrightarrow{\text{Keyframe-VAE}} \{\mathbf{F}_i\}_{i=1}^{1+T_{kf}}, \quad \mathbf{F}_i \in \mathbb{R}^{1\times \frac{H}{8}\times \frac{W}{8}\times C}.

Keyframes are sparsely sampled with larger temporal intervals, so the model covers the same camera motion span as dense videos while avoiding redundant intermediate frames (HY-World et al., 15 Apr 2026). In the original WorldStereo, diffusion operated in a video latent space zRB×F×H×W×Cz \in \mathbb{R}^{B \times F \times H \times W \times C} through a Wan-based DiT using self-attention over space and time (Zhang et al., 2 Mar 2026). The transition to keyframe latents is therefore not merely an efficiency adjustment; it redefines the temporal granularity at which geometry is preserved.

Camera control is implemented through a camera-adapter branch. Starting from a reference image Iref\mathbf{I}^{ref} and its point cloud Pref\mathbf{P}^{ref}, target-view geometry is formed by back-projection:

Pitar(x)RicwD(x)Ki1x^,\mathbf{P}^{tar}_i(x) \simeq \mathbf{R}^{c \rightarrow w}_i \mathrm{D}(x)\mathbf{K}^{-1}_i \hat{x},

where x^\hat{x} is the homogeneous pixel coordinate (HY-World et al., 15 Apr 2026). The resulting point-cloud renders are encoded with the Keyframe-VAE and fed through a lightweight transformer camera adapter and cross-attention into the DiT. In the final domain-adaptation setting, cross-attention and FFN layers are frozen while other blocks adapt to the keyframe latent; this achieved RotErr {Ci}i=1Tex\{\mathbf{C}_i\}_{i=1}^{T_{ex}}0, TransErr {Ci}i=1Tex\{\mathbf{C}_i\}_{i=1}^{T_{ex}}1, ATE {Ci}i=1Tex\{\mathbf{C}_i\}_{i=1}^{T_{ex}}2, and the highest user-rated quality in the reported ablation (HY-World et al., 15 Apr 2026).

The predecessor framework already relied on explicit camera conditioning through Uni3C, a ControlNet-style wrapper over Wan2.1-14B-I2V, using Plücker rays and local point-cloud guidance while leaving the large backbone frozen (Zhang et al., 2 Mar 2026). WorldStereo 2.0 retains the camera-conditioned DiT logic but alters the latent substrate and the fine-tuning regime to stabilize view generation under large camera motions.

3. Geometric memory: GGM and SSM++

The expression “consistent memory” in WorldStereo 2.0 refers to two complementary mechanisms: Global-Geometric Memory (GGM) and Improved Spatial-Stereo Memory (SSM++) (HY-World et al., 15 Apr 2026). GGM provides a global 3D prior through extended point clouds. Beginning from a reference point cloud {Ci}i=1Tex\{\mathbf{C}_i\}_{i=1}^{T_{ex}}3 and additional points {Ci}i=1Tex\{\mathbf{C}_i\}_{i=1}^{T_{ex}}4 sampled from {Ci}i=1Tex\{\mathbf{C}_i\}_{i=1}^{T_{ex}}5 novel views, the model constructs

{Ci}i=1Tex\{\mathbf{C}_i\}_{i=1}^{T_{ex}}6

Rendered videos from {Ci}i=1Tex\{\mathbf{C}_i\}_{i=1}^{T_{ex}}7 are then used during training so that the model is forced to respect the underlying 3D geometry, rather than treating the point cloud as a weak hint (HY-World et al., 15 Apr 2026). At inference, {Ci}i=1Tex\{\mathbf{C}_i\}_{i=1}^{T_{ex}}8 serves as {Ci}i=1Tex\{\mathbf{C}_i\}_{i=1}^{T_{ex}}9, supplying Ppan\mathbf{P}^{pan}0 structural coverage from the outset.

This extends the original WorldStereo notion of a global-geometric memory, where a 3D cache Ppan\mathbf{P}^{pan}1 was incrementally updated by generating a trajectory, reconstructing a point cloud with WorldMirror, aligning it to the cache through Umeyama alignment, and merging it back into memory (Zhang et al., 2 Mar 2026). In that framework, GGM was explicitly described as injecting coarse structural priors through an incrementally updated global point cloud. WorldStereo 2.0 retains the same conceptual role of global memory while recasting it around a panoramic point cloud and rendered geometric guidance.

SSM++ revises the local consistency mechanism even more substantially. WorldStereo 1.0 used SSM with retrieved reference frames, pointmaps encoding normalized 3D coordinates, horizontally stitched target-reference latents, and constrained self-attention operating only within each two-view pair (Zhang et al., 2 Mar 2026). WorldStereo 2.0 upgrades this to a formulation in which retrieved historical reference keyframes are horizontally stitched with the target along width, from Ppan\mathbf{P}^{pan}2 to Ppan\mathbf{P}^{pan}3, and then fed directly into the main DiT as additional tokens, enabling full self-attention across target and reference features (HY-World et al., 15 Apr 2026). Rotary positional encoding is modified on the stitched spatial grid so that the left and right halves receive different spatial coordinates while sharing the same temporal index. Camera poses are normalized into a 7D vector, consisting of quaternion and translation, and processed by a 3-layer MLP to form camera tokens that are added to both target and reference features.

Unlike the original SSM, SSM++ does not use a separate memory branch and does not rely on explicit pointmaps. Instead, it uses selective retrieval based on 3D FoV similarity, up to Ppan\mathbf{P}^{pan}4 references per clip, and camera embeddings provide the geometric cues needed for cross-view disambiguation (HY-World et al., 15 Apr 2026). The HY-World 2.0 ablation states that replacing spatial stitching with temporal concatenation severely degrades both consistency and camera metrics, which is presented as confirmation that the spatial-stereo design is critical (HY-World et al., 15 Apr 2026).

4. Training stages and memory-aware distillation

WorldStereo 2.0 is trained in three stages: domain-adaptation, middle training with memory, and post-training distillation (HY-World et al., 15 Apr 2026). Domain-adaptation converts the original video VDM into a camera-controlled keyframe generator. Middle training adds GGM and SSM++ to enforce multi-trajectory consistency. Post-training applies DMD to obtain a 4-step student for fast inference.

The middle-training stage requires multi-view trajectories and cross-trajectory correspondences. The reported data sources include real datasets such as DL3DV-10k, TartanAir, MapFree, and RGBD-objects, together with synthetic Unreal Engine scenes containing multiple trajectories per asset (HY-World et al., 15 Apr 2026). Training-time robustness is improved through memory augmentation: depth downsampling, blurring, raw noisy monocular depth for point-cloud conditioning, and motion blur, color jitter, and random cropping for retrieved frames (HY-World et al., 15 Apr 2026). The original WorldStereo paper described related robustness measures, including point-cloud masking by randomly dropping 30–70% of points and masking 20–70% of image area before back-projection, so that global point clouds are treated as coarse structural priors rather than exact ground truth (Zhang et al., 2 Mar 2026).

The explicit training formula emphasized in both descriptions is the DMD objective:

Ppan\mathbf{P}^{pan}5

In WorldStereo 1.0, DMD distilled only the camera-control behavior while keeping the memory branches uninvolved during distillation, producing a 4-step student with roughly Ppan\mathbf{P}^{pan}6 speedup relative to the 40-step teacher (Zhang et al., 2 Mar 2026). In WorldStereo 2.0, by contrast, distillation is memory-aware: GGM and SSM++ remain active, and synthetic UE multi-trajectory data make this feasible (HY-World et al., 15 Apr 2026). This is one of the sharpest distinctions between the two versions.

Aspect WorldStereo WorldStereo 2.0
Latent formulation Video-VAE with spatio-temporal compression Keyframe-VAE with per-image compression only
Local memory SSM with separate memory branch, constrained attention, explicit pointmaps SSM++ in main DiT, full self-attention, camera embeddings
Distillation Camera-control only; memory frozen Distillation with full memory active

The table summarizes explicit architectural changes. A plausible implication is that WorldStereo 2.0 trades dense temporal continuity for higher per-view geometric fidelity and more direct integration of retrieval-based memory into the main generative backbone.

5. Inference, trajectory conditioning, and 3D world construction

At inference time, WorldStereo 2.0 assumes a panorama Ppan\mathbf{P}^{pan}7, a panoramic point cloud Ppan\mathbf{P}^{pan}8, and explicit trajectories from WorldNav (HY-World et al., 15 Apr 2026). The memory bank is initialized by subdividing the panorama into Ppan\mathbf{P}^{pan}9 perspective views {Vi}\{\mathbf{V}_i\}0, encoding them with the Keyframe-VAE, and inserting them into the SSM++ memory bank. GGM is initialized by setting {Vi}\{\mathbf{V}_i\}1 as the global point cloud {Vi}\{\mathbf{V}_i\}2 and rendering sparse geometry images from it as guidance for the camera adapter. Each trajectory is split into clips of {Vi}\{\mathbf{V}_i\}3 frames, and for each clip the model retrieves up to {Vi}\{\mathbf{V}_i\}4 relevant keyframes through 3D FoV similarity, spatially stitches them as SSM++ inputs, runs the 4-step distilled DiT, and decodes the resulting keyframe latents (HY-World et al., 15 Apr 2026). Newly generated keyframes and poses are then appended to memory.

This preserves the generate-first, then reconstruct logic already present in WorldStereo, where multiple trajectories are generated and then reconstructed into a point cloud by WorldMirror (Zhang et al., 2 Mar 2026). In HY-World 2.0, however, this reconstruction pathway is extended into a depth-alignment and 3DGS pipeline. WorldMirror 2.0 predicts depths and normals,

{Vi}\{\mathbf{V}_i\}5

after which the predicted depths are aligned to the panoramic point cloud by first defining a reliability mask

{Vi}\{\mathbf{V}_i\}6

and then fitting a linear depth transform

{Vi}\{\mathbf{V}_i\}7

with RANSAC on the valid region (HY-World et al., 15 Apr 2026). The aligned depths are back-projected and fused with {Vi}\{\mathbf{V}_i\}8 to form an expanded point cloud {Vi}\{\mathbf{V}_i\}9, which initializes the subsequent 3DGS optimization (HY-World et al., 15 Apr 2026).

WorldStereo 2.0 therefore functions as the view-synthesis mechanism that makes downstream point-cloud expansion and Gaussian scene optimization feasible. The papers explicitly state that inconsistent generated views would make alignment unreliable, so camera fidelity and cross-trajectory consistency are not secondary perceptual qualities; they are structural preconditions for the later 3D stages (HY-World et al., 15 Apr 2026).

6. Empirical performance, comparisons, and limitations

HY-World 2.0 reports that WorldStereo 2.0 improves camera control relative to WorldStereo 1.0 even before memory is considered. In the camera-control comparison, WorldStereo 2.0* versus WorldStereo* yields RotErr {Vi}i=11+TkfKeyframe-VAE{Fi}i=11+Tkf,FiR1×H8×W8×C.\{\mathbf{V}_i\}_{i=1}^{1+T_{kf}} \xrightarrow{\text{Keyframe-VAE}} \{\mathbf{F}_i\}_{i=1}^{1+T_{kf}}, \quad \mathbf{F}_i \in \mathbb{R}^{1\times \frac{H}{8}\times \frac{W}{8}\times C}.0 versus {Vi}i=11+TkfKeyframe-VAE{Fi}i=11+Tkf,FiR1×H8×W8×C.\{\mathbf{V}_i\}_{i=1}^{1+T_{kf}} \xrightarrow{\text{Keyframe-VAE}} \{\mathbf{F}_i\}_{i=1}^{1+T_{kf}}, \quad \mathbf{F}_i \in \mathbb{R}^{1\times \frac{H}{8}\times \frac{W}{8}\times C}.1, TransErr {Vi}i=11+TkfKeyframe-VAE{Fi}i=11+Tkf,FiR1×H8×W8×C.\{\mathbf{V}_i\}_{i=1}^{1+T_{kf}} \xrightarrow{\text{Keyframe-VAE}} \{\mathbf{F}_i\}_{i=1}^{1+T_{kf}}, \quad \mathbf{F}_i \in \mathbb{R}^{1\times \frac{H}{8}\times \frac{W}{8}\times C}.2 versus {Vi}i=11+TkfKeyframe-VAE{Fi}i=11+Tkf,FiR1×H8×W8×C.\{\mathbf{V}_i\}_{i=1}^{1+T_{kf}} \xrightarrow{\text{Keyframe-VAE}} \{\mathbf{F}_i\}_{i=1}^{1+T_{kf}}, \quad \mathbf{F}_i \in \mathbb{R}^{1\times \frac{H}{8}\times \frac{W}{8}\times C}.3, ATE {Vi}i=11+TkfKeyframe-VAE{Fi}i=11+Tkf,FiR1×H8×W8×C.\{\mathbf{V}_i\}_{i=1}^{1+T_{kf}} \xrightarrow{\text{Keyframe-VAE}} \{\mathbf{F}_i\}_{i=1}^{1+T_{kf}}, \quad \mathbf{F}_i \in \mathbb{R}^{1\times \frac{H}{8}\times \frac{W}{8}\times C}.4 versus {Vi}i=11+TkfKeyframe-VAE{Fi}i=11+Tkf,FiR1×H8×W8×C.\{\mathbf{V}_i\}_{i=1}^{1+T_{kf}} \xrightarrow{\text{Keyframe-VAE}} \{\mathbf{F}_i\}_{i=1}^{1+T_{kf}}, \quad \mathbf{F}_i \in \mathbb{R}^{1\times \frac{H}{8}\times \frac{W}{8}\times C}.5, and CLIP-I {Vi}i=11+TkfKeyframe-VAE{Fi}i=11+Tkf,FiR1×H8×W8×C.\{\mathbf{V}_i\}_{i=1}^{1+T_{kf}} \xrightarrow{\text{Keyframe-VAE}} \{\mathbf{F}_i\}_{i=1}^{1+T_{kf}}, \quad \mathbf{F}_i \in \mathbb{R}^{1\times \frac{H}{8}\times \frac{W}{8}\times C}.6 versus {Vi}i=11+TkfKeyframe-VAE{Fi}i=11+Tkf,FiR1×H8×W8×C.\{\mathbf{V}_i\}_{i=1}^{1+T_{kf}} \xrightarrow{\text{Keyframe-VAE}} \{\mathbf{F}_i\}_{i=1}^{1+T_{kf}}, \quad \mathbf{F}_i \in \mathbb{R}^{1\times \frac{H}{8}\times \frac{W}{8}\times C}.7 (HY-World et al., 15 Apr 2026). For single-view generative 3D reconstruction, WorldStereo 2.0 (DMD) on Tanks-and-Temples achieves Precision {Vi}i=11+TkfKeyframe-VAE{Fi}i=11+Tkf,FiR1×H8×W8×C.\{\mathbf{V}_i\}_{i=1}^{1+T_{kf}} \xrightarrow{\text{Keyframe-VAE}} \{\mathbf{F}_i\}_{i=1}^{1+T_{kf}}, \quad \mathbf{F}_i \in \mathbb{R}^{1\times \frac{H}{8}\times \frac{W}{8}\times C}.8, Recall {Vi}i=11+TkfKeyframe-VAE{Fi}i=11+Tkf,FiR1×H8×W8×C.\{\mathbf{V}_i\}_{i=1}^{1+T_{kf}} \xrightarrow{\text{Keyframe-VAE}} \{\mathbf{F}_i\}_{i=1}^{1+T_{kf}}, \quad \mathbf{F}_i \in \mathbb{R}^{1\times \frac{H}{8}\times \frac{W}{8}\times C}.9, F1 zRB×F×H×W×Cz \in \mathbb{R}^{B \times F \times H \times W \times C}0, and AUC zRB×F×H×W×Cz \in \mathbb{R}^{B \times F \times H \times W \times C}1, compared with best reported baseline F1 values of zRB×F×H×W×Cz \in \mathbb{R}^{B \times F \times H \times W \times C}2 for Gen3C and zRB×F×H×W×Cz \in \mathbb{R}^{B \times F \times H \times W \times C}3 for Lyra. On MipNeRF360, it achieves F1 zRB×F×H×W×Cz \in \mathbb{R}^{B \times F \times H \times W \times C}4 and AUC zRB×F×H×W×Cz \in \mathbb{R}^{B \times F \times H \times W \times C}5, compared with F1 zRB×F×H×W×Cz \in \mathbb{R}^{B \times F \times H \times W \times C}6 and AUC zRB×F×H×W×Cz \in \mathbb{R}^{B \times F \times H \times W \times C}7 for SEVA, F1 zRB×F×H×W×Cz \in \mathbb{R}^{B \times F \times H \times W \times C}8 and AUC zRB×F×H×W×Cz \in \mathbb{R}^{B \times F \times H \times W \times C}9 for Gen3C, and F1 Iref\mathbf{I}^{ref}0 and AUC Iref\mathbf{I}^{ref}1 for Lyra (HY-World et al., 15 Apr 2026).

The memory ablation is especially diagnostic. A baseline camera-control-only configuration reports PSNR Iref\mathbf{I}^{ref}2 and SSIM Iref\mathbf{I}^{ref}3. Full memory with large batch size reports PSNR Iref\mathbf{I}^{ref}4, SSIM Iref\mathbf{I}^{ref}5, PSNRIref\mathbf{I}^{ref}6 Iref\mathbf{I}^{ref}7, and SSIMIref\mathbf{I}^{ref}8 Iref\mathbf{I}^{ref}9. After distillation, PSNR rises to Pref\mathbf{P}^{ref}0, PSNRPref\mathbf{P}^{ref}1 to Pref\mathbf{P}^{ref}2, and SSIMPref\mathbf{P}^{ref}3 to Pref\mathbf{P}^{ref}4, while maintaining good camera metrics (HY-World et al., 15 Apr 2026). In the original WorldStereo study, the analogous conclusion was that GGM improved coarse structure and SSM improved fine-detail consistency, with PSNR and LPIPS gains on a dedicated memory benchmark (Zhang et al., 2 Mar 2026). The two reports are therefore consistent in attributing geometric stability to global memory and high-frequency consistency to stereo-style local retrieval.

Several limitations are stated or directly implied. WorldStereo 2.0 assumes mostly static scenes; dynamic objects are not explicitly modeled and may cause ghosting or inconsistent geometry (HY-World et al., 15 Apr 2026). It depends on monocular depth, especially MoGe2, so failures in outdoor scenes, sky estimation, or global scale propagate into GGM and can degrade camera guidance (HY-World et al., 15 Apr 2026). Memory bank size and computation scale with the number of keyframes, and very long or dense trajectories may become expensive despite selective retrieval (HY-World et al., 15 Apr 2026). Extreme occlusions and highly cluttered environments remain difficult, and lighting changes are not explicitly modeled (HY-World et al., 15 Apr 2026). The earlier WorldStereo paper adds a closely related limitation: reconstruction is outside the diffusion model, so there is no end-to-end gradient connecting 3D quality back to video generation (Zhang et al., 2 Mar 2026).

Taken together, these sources characterize WorldStereo 2.0 as a geometry-aware, keyframe-based world-expansion model that upgrades the original WorldStereo memory design for panorama-conditioned, trajectory-driven view synthesis at scale. Its central contribution is not a new 3D representation by itself, but a generative interface between camera planning, persistent geometric memory, and downstream reconstruction systems that require camera-accurate, cross-view-consistent imagery (HY-World et al., 15 Apr 2026).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to WorldStereo 2.0.