Generated Reality: World Simulation via Interactive Video with Hand & Camera Control

This presentation explores a breakthrough in immersive XR experiences that replaces traditional 3D rendering with real-time generative video. The researchers develop a video diffusion system that responds simultaneously to tracked head pose and detailed hand movements, enabling users to manipulate virtual environments naturally without pre-built assets. Through hybrid conditioning strategies and real-time distillation, the system achieves photorealistic, embodied simulation suitable for interactive VR applications, demonstrating a 68-point leap in task completion when precise hand control is incorporated.
Script
Creating immersive virtual worlds has always demanded armies of 3D artists, massive compute budgets, and months of development time. But what if a headset could generate photorealistic environments in real time, responding instantly to every flick of your wrist and turn of your head?
The researchers identified a fundamental mismatch. Video world models can generate stunning footage, but they lack the fine-grained embodied control that makes virtual reality feel real. Without accurate hand tracking, users can't naturally pick up objects, turn knobs, or interact with their surroundings.
The authors propose a hybrid conditioning strategy that bridges this gap.
They split hand representation into two complementary streams. A 2D skeleton video provides pixel-aligned spatial cues, while 3D hand pose parameters encode precise joint articulation. Token addition in the latent space merges these modalities, resolving depth ambiguity without sacrificing real-time performance.
The full pipeline starts with a commercial VR headset tracking both head and hand poses. Camera pose is encoded via Plücker-ray embeddings, while hand data flows through the hybrid conditioning pathway. These signals converge in a transformer-based diffusion model that autoregressively generates video frames, achieving up to 11 frames per second with 1.4 seconds of latency on a single GPU.
To enable real-time interaction, the authors distilled a slow, bidirectional teacher model into a fast, autoregressive student using a self-forcing technique. This compression preserves visual fidelity while slashing inference time, making the system practical for live VR deployment.
Qualitative results show the system handling challenging scenarios: hands partially occluded by objects, fingers curled around handles, palms pressed against surfaces. The hybrid approach outperforms both 2D-only and 3D-only conditioning, especially when depth ambiguity or complex articulation is present.
In a controlled user study, participants wearing VR headsets attempted manipulation tasks in generated environments. Text-only baselines yielded a dismal 3% success rate. But when hand and head tracking were enabled, task completion soared to 71%, and users reported feeling dramatically more in control—a 2.5-point increase on a 7-point scale.
The system is not yet ready for consumer deployment. Frame rates, latency, and long-term stability all fall short of commercial VR expectations. Resolution constraints and the lack of true stereo rendering limit immersion. But these are engineering challenges, not fundamental roadblocks, and the authors lay out a clear path forward.
This work proves that asset-free, embodied XR is no longer science fiction—it's a tractable engineering problem. When video models learn to see through your eyes and respond to your hands, virtual worlds stop being stages and start becoming places you can touch. Visit EmergentMind.com to dive deeper into this research.