GEN3C: 3D-Informed World-Consistent Video Generation with Precise Camera Control (2503.03751v1)

Published 5 Mar 2025 in cs.CV and cs.GR

Abstract: We present GEN3C, a generative video model with precise Camera Control and temporal 3D Consistency. Prior video models already generate realistic videos, but they tend to leverage little 3D information, leading to inconsistencies, such as objects popping in and out of existence. Camera control, if implemented at all, is imprecise, because camera parameters are mere inputs to the neural network which must then infer how the video depends on the camera. In contrast, GEN3C is guided by a 3D cache: point clouds obtained by predicting the pixel-wise depth of seed images or previously generated frames. When generating the next frames, GEN3C is conditioned on the 2D renderings of the 3D cache with the new camera trajectory provided by the user. Crucially, this means that GEN3C neither has to remember what it previously generated nor does it have to infer the image structure from the camera pose. The model, instead, can focus all its generative power on previously unobserved regions, as well as advancing the scene state to the next frame. Our results demonstrate more precise camera control than prior work, as well as state-of-the-art results in sparse-view novel view synthesis, even in challenging settings such as driving scenes and monocular dynamic video. Results are best viewed in videos. Check out our webpage! https://research.nvidia.com/labs/toronto-ai/GEN3C/

Summary

Gen3C: 3D-Informed World-Consistent Video Generation with Precise Camera Control

The paper presents a novel video generation model termed Gen3C, predominantly focusing on integrating three-dimensional awareness to fortify consistency and precision in video synthesis. The model addresses prevalent issues in existing video generation methodologies such as temporal inconsistencies and imprecise camera controls, which often manifest when inadequate 3D information guides the neural network's generation process.

Overview and Methodology

Gen3C leverages a 3D cache, structured as point clouds derived from estimating the depth of initial frames or seed images, to maintain world consistency over the generated video sequences. This 3D cache significantly enhances camera control by providing explicit structures for the model to follow. When new frames are rendered, this cache enables the system to focus its generative capabilities exclusively on previously unseen areas and appends subsequent frame states, thereby maintaining consistency.

The methodology of Gen3C effectively mitigates the model's need to rely on memories of prior generations or extraneous inference from camera poses. Instead, rendering the 3D cache to 2D under user-directed camera trajectories informs the video diffusion model with strong geometrical conditioning. This innovative approach empowers Gen3C to outperform prior state-of-the-art benchmarks in various scenarios, including sparse-view novel view synthesis, handling complex scenes like driving environments, or when dealing with monocular dynamic videos.

Numerical Results and Claims

In evaluating the model's efficacy, Gen3C produces demonstrably superior numerical results compared to existing models across multiple tasks, including higher scores in PSNR, SSIM, and lower LPIPS values on datasets such as Tanks-and-Temples and RE10K. The model additionally boasts TSED metric improvements, showcasing enhanced temporal and spatial consistency. These outcomes symbolize the model's adeptness at filling disocclusions and rendering high-quality details that align with accurate camera controls.

Implications and Future Directions

Practical implications of Gen3C are substantial, considering its capabilities to render consistent and precise videos across various applications — from entertainment and digital content creation to virtual reality simulations and autonomous driving systems. The explicit 3D cache not only allows unprecedented control over video synthesis but also promotes effortless scene editing, such as object removal or motion alteration, enabled through direct manipulations within the 3D point cloud structure.

The paper also highlights the theoretical implications of integrating 3D modeling insights with video diffusion models, laying a groundwork that could inspire further research into merging geometrical constructions with generative models. Future directions might explore incorporating text-based conditioning to prompt motion or investigating adaptability across broader ranges of dynamic scenes, potentially leading to advances in self-supervised learning or zero-shot learning capabilities.

In summary, Gen3C sets a benchmark for implementing 3D consistency and precise command over video generation, signifying a crucial step towards efficient and high-quality video content creation. The adaptability of its 3D cache promises a versatile application across domains necessitating realistic video renderings with coherent motion and precise viewpoint manipulation.