WristWorld: Generating Wrist-Views via 4D World Models for Robotic Manipulation (2510.07313v1)

Published 8 Oct 2025 in cs.CV and cs.RO

Abstract: Wrist-view observations are crucial for VLA models as they capture fine-grained hand-object interactions that directly enhance manipulation performance. Yet large-scale datasets rarely include such recordings, resulting in a substantial gap between abundant anchor views and scarce wrist views. Existing world models cannot bridge this gap, as they require a wrist-view first frame and thus fail to generate wrist-view videos from anchor views alone. Amid this gap, recent visual geometry models such as VGGT emerge with geometric and cross-view priors that make it possible to address extreme viewpoint shifts. Inspired by these insights, we propose WristWorld, the first 4D world model that generates wrist-view videos solely from anchor views. WristWorld operates in two stages: (i) Reconstruction, which extends VGGT and incorporates our Spatial Projection Consistency (SPC) Loss to estimate geometrically consistent wrist-view poses and 4D point clouds; (ii) Generation, which employs our video generation model to synthesize temporally coherent wrist-view videos from the reconstructed perspective. Experiments on Droid, Calvin, and Franka Panda demonstrate state-of-the-art video generation with superior spatial consistency, while also improving VLA performance, raising the average task completion length on Calvin by 3.81% and closing 42.4% of the anchor-wrist view gap.

Summary

The paper introduces a two-stage 4D generative world model that synthesizes wrist-view videos solely from anchor views, bridging a critical data gap.
The methodology leverages a VGGT-based wrist head, SPC loss, and a diffusion-based video generator to ensure geometric and temporal consistency.
Experimental results demonstrate state-of-the-art performance across datasets, significantly improving task completion and bridging the exocentric–egocentric gap.

WristWorld: Generating Wrist-Views via 4D World Models for Robotic Manipulation

Introduction and Motivation

WristWorld addresses a critical bottleneck in vision–language–action (VLA) robotics: the scarcity of wrist-view data in large-scale manipulation datasets. Wrist-view observations are essential for capturing fine-grained hand–object interactions, which are pivotal for precise manipulation and robust policy learning. However, most datasets predominantly provide anchor (third-person) views, resulting in a substantial anchor–wrist domain gap. Existing world models are unable to synthesize wrist-view sequences from anchor views alone, as they typically require a wrist-view first frame for conditioning. WristWorld introduces a two-stage 4D generative world model that bridges this gap, enabling the synthesis of geometrically and temporally consistent wrist-view videos solely from anchor views.

Figure 1: WristWorld synthesizes realistic wrist-view videos from anchor views via a two-stage process, expanding training data and improving downstream VLA model performance.

Methodology

Two-Stage 4D Generative World Model

WristWorld comprises two distinct stages: reconstruction and generation.

Reconstruction Stage:

This stage extends the Visual Geometry Grounded Transformer (VGGT) with a dedicated wrist head. The wrist head regresses wrist camera extrinsics (rotation and translation) from aggregated multi-view features using transformer-based attention. A novel Spatial Projection Consistency (SPC) loss is introduced, which enforces geometric alignment between 2D correspondences and reconstructed 3D/4D geometry using only RGB data, obviating the need for explicit depth or extrinsic supervision. The output is a sequence of wrist-view condition maps, generated by projecting reconstructed 3D scenes into the wrist-view image plane.

Generation Stage:

A diffusion-based video generator (DiT) synthesizes temporally coherent wrist-view videos conditioned on the wrist-view projections and CLIP-encoded anchor-view features. The conditioning stream concatenates VAE-encoded wrist-view projections with noisy wrist-view latents, and incorporates global semantics via CLIP and text embeddings. This design ensures that the generated wrist-view sequences are both geometrically consistent and semantically aligned with the anchor views.

Figure 2: Overview of the two-stage 4D generative world model: reconstruction (VGGT + wrist head + SPC loss) and generation (DiT conditioned on wrist projections and CLIP features).

Key Components

Wrist Head: Transformer decoder with learnable wrist queries, regressing wrist camera pose from multi-view features.
SPC Loss: Enforces geometric consistency by penalizing reprojection errors and infeasible depth values, using dense 2D–2D correspondences lifted to 3D–2D pairs.
Condition Map Generation: Projects reconstructed 3D scenes into wrist-view frames, providing temporally aligned structural guidance.
Video Generation Model: DiT with expanded patch embedding to accommodate concatenated latents, conditioned on CLIP and text features for semantic fidelity.

Experimental Results

Quantitative and Qualitative Evaluation

WristWorld is evaluated on Droid, Calvin, and Franka Panda datasets. It achieves state-of-the-art performance across all video generation metrics (FVD, LPIPS, SSIM, PSNR) without requiring wrist-view first-frame guidance. Notably, on the Franka Panda dataset, WristWorld demonstrates robust generalization to real-world manipulation scenarios with significant viewpoint variation.

Figure 3: Generated condition maps exhibit superior viewpoint consistency compared to 3D Base and WoW 14B baselines.

Figure 4: On the Calvin benchmark, WristWorld produces wrist-view videos with improved spatial and viewpoint consistency over SVD.

Figure 5: On Franka real-robot data, WristWorld generates wrist-view sequences closely matching ground-truth observations, demonstrating strong third-to-wrist generalization.

Figure 6: Qualitative visualization on Franka Panda: WristWorld maintains superior geometric consistency and wrist-following behavior compared to prior models.

Downstream VLA Performance

Augmenting VLA training data with synthesized wrist-view videos from WristWorld yields measurable improvements. On Calvin, the average task completion length increases by 3.81%, and the anchor–wrist gap is closed by 42.4%. On Franka Panda, task success rates improve by 15.5% on average when using generated wrist views. These results confirm that synthetic wrist-view data provides effective supervisory signals for VLA models, enhancing both perception and control.

Plug-and-Play Extension

WristWorld can be integrated as a plug-in module to existing single-view world models, enabling post-hoc synthesis of wrist-view videos from anchor-view rollouts without requiring wrist-view initialization. This design facilitates scalable multi-view extension and enriches the observation space for downstream tasks.

Ablation Study

Ablation experiments reveal that wrist-view projection and SPC loss are critical for high-quality video generation. Removing either component leads to substantial degradation in temporal coherence and perceptual fidelity, underscoring the necessity of projection-based conditioning and geometric supervision.

Implementation Details

Hardware: 8 × A800 GPUs for both reconstruction and generation stages.
Training: Reconstruction stage (12h pretrain, 6h finetune), generation stage (24h pretrain, 12h finetune), batch size 4, resolution 640×480.
Model Architecture: VGGT backbone with wrist head (3 layers, 8 heads, 1024-dim), DiT with expanded patch embedding (32 channels), VAE for latent encoding, CLIP for semantic conditioning.
Loss Functions: SPC loss for geometric consistency, MSE for reprojection error, negative depth penalty for infeasible projections.
Figure 7: Experimental setup with Franka Panda manipulator and multi-view RealSense cameras, enabling robust perception and manipulation.

Implications and Future Directions

WristWorld provides a scalable solution for bridging the exocentric–egocentric gap in robotic datasets, enabling the synthesis of wrist-centric views from abundant anchor-view data. This approach reduces the need for costly wrist-view data collection and calibration, and facilitates the development of more robust VLA policies. The framework's plug-and-play design allows seamless integration with existing world models, promoting multi-view generalization and improved manipulation performance.

Future research may explore extending WristWorld to other egocentric perspectives (e.g., finger-tip or tool-mounted cameras), incorporating more advanced geometric priors, and leveraging self-supervised or unsupervised learning for further data efficiency. Additionally, investigating the impact of synthetic wrist-view data on policy transfer and sim-to-real adaptation remains a promising direction.

Conclusion

WristWorld introduces a principled two-stage 4D generative framework for synthesizing wrist-view videos from anchor views, leveraging geometric transformers, projection-based conditioning, and diffusion-based video generation. Empirical results demonstrate strong video generation quality and significant gains in downstream VLA performance. By augmenting datasets with wrist-centric views, WristWorld offers a practical and scalable solution for advancing robotic manipulation and embodied AI.