Generating 3D-Consistent Videos from Unposed Internet Photos (2411.13549v1)

Published 20 Nov 2024 in cs.CV

Abstract: We address the problem of generating videos from unposed internet photos. A handful of input images serve as keyframes, and our model interpolates between them to simulate a path moving between the cameras. Given random images, a model's ability to capture underlying geometry, recognize scene identity, and relate frames in terms of camera position and orientation reflects a fundamental understanding of 3D structure and scene layout. However, existing video models such as Luma Dream Machine fail at this task. We design a self-supervised method that takes advantage of the consistency of videos and variability of multiview internet photos to train a scalable, 3D-aware video model without any 3D annotations such as camera parameters. We validate that our method outperforms all baselines in terms of geometric and appearance consistency. We also show our model benefits applications that enable camera control, such as 3D Gaussian Splatting. Our results suggest that we can scale up scene-level 3D learning using only 2D data such as videos and multiview internet photos.

Summary

The paper proposes a novel self-supervised framework using multiview inpainting and a latent Diffusion Transformer to achieve 3D-aware video synthesis.
The paper demonstrates superior geometric consistency and aesthetic quality over existing methods, validated through extensive experiments and user studies.
The paper shows that its approach enhances downstream applications like SfM reconstruction and 3D Gaussian Splatting, enabling scalable learning from unlabeled data.

An Overview of the Methodology for Generating 3D-Consistent Videos from Unposed Internet Photos

The paper presents a novel approach for generating 3D-consistent videos from a small set of unposed internet photos. The work is distinctive in utilizing unlabeled internet photos as keyframes to interpolate realistic camera paths and consistent geometric structures. The underlying method incorporates self-supervised learning to address the limitations of existing models, such as the Luma Dream Machine, which often struggles to maintain geometric consistency in video generation.

Methodological Framework

The core methodological innovation revolves around two training objectives aimed at capturing strong 3D priors without direct 3D supervision: multiview inpainting and view interpolation. These objectives are implemented using a latent Diffusion Transformer (DiT), which facilitates the learning of complex spatial-temporal representations crucial for 3D-aware video synthesis.

Multiview Inpainting: This method conditions the model on several view images from a scene, where 80% of the target frame is masked. The model learns to inpaint these masked regions, leveraging structural information from the available conditions. This task relies predominantly on internet photographs, promoting the model's adaptability to varying viewpoints and lighting conditions typical of uncurated multimedia content.
View Interpolation: This training task involves the generation of intermediate frames from a sequence of video frames, between defined start and end points. These frames are conditioned on both the frames themselves and additional context information such as CLIP embeddings for controlling illumination, enabling the generation of temporally coherent and realistic camera motions.

Experimental Results and Analysis

Extensive experiments demonstrate that the proposed method surpasses existing state-of-the-art techniques in generating geometrically coherent videos, even when tested on datasets composed of widely spaced web photos. The model's performance was validated through a user paper where respondents consistently preferred this approach over others for consistency, camera motion, and aesthetic quality. Notably, the method excelled over commercial solutions like the Luma Dream Machine, particularly when addressing large baselines and scene artifacts often encountered in internet-sourced imagery.

The paper also details the model's efficacy in downstream applications such as structure-from-motion (SfM) reconstruction and 3D Gaussian Splatting (3DGS), where it significantly enhanced geometric consistency and improved rendering metrics. These results underscore the scalability and application potential of this self-supervised approach.

Implications and Future Directions

This research suggests potential pathways for scaling 3D learning through leveraging vast amounts of unlabeled 2D data, possibly impacting domains such as autonomous navigation and virtual reality. The success of self-supervised learning objectives in 3D video synthesis proposes a paradigm in which robust geometric understanding can be attained without relying on extensive annotations such as camera poses, easing data preparation bottlenecks.

Future work might focus on including dynamic objects, refining fine-grained illumination controls, and exploring broader scene understanding models that can extrapolate beyond observed viewpoints efficiently. These advancements could significantly enrich the practicality and robustness of AI-driven visual content generation systems in real-world applications.

In conclusion, the methodology and results presented are a testament to the efficacy of self-supervised learning paradigms in advancing 3D-aware video modeling, pushing the boundaries of how unstructured internet content can be utilized to craft contextually rich and geometrically consistent visual narratives.

PDF Markdown

Related Papers

Tweets

https://twitter.com/janusch_patas/status/1859500271828672999

https://twitter.com/Almorgand/status/1860995864757412301