JOG3R: Towards 3D-Consistent Video Generators (2501.01409v2)

Published 2 Jan 2025 in cs.CV and cs.AI

Abstract: Emergent capabilities of image generators have led to many impactful zero- or few-shot applications. Inspired by this success, we investigate whether video generators similarly exhibit 3D-awareness. Using structure-from-motion as a 3D-aware task, we test if intermediate features of a video generator - OpenSora in our case - can support camera pose estimation. Surprisingly, at first, we only find a weak correlation between the two tasks. Deeper investigation reveals that although the video generator produces plausible video frames, the frames themselves are not truly 3D-consistent. Instead, we propose to jointly train for the two tasks, using photometric generation and 3D aware errors. Specifically, we find that SoTA video generation and camera pose estimation (i.e.,DUSt3R [79]) networks share common structures, and propose an architecture that unifies the two. The proposed unified model, named \nameMethod, produces camera pose estimates with competitive quality while producing 3D-consistent videos. In summary, we propose the first unified video generator that is 3D-consistent, generates realistic video frames, and can potentially be repurposed for other 3D-aware tasks.

Summary

The paper introduces a unified model, JOint Generation and 3D camera Reconstruction (JO ), which integrates video generation and 3D camera pose estimation within a single architecture.
Researchers evaluated the 3D spatial awareness capability of video generation features by routing intermediate outputs to a state-of-the-art camera pose estimation method.
The proposed model achieves performance comparable to specialized camera pose estimators while maintaining high-quality video generation, suggesting potential for efficient architectures in applications like AR or robotics.

On Unifying Video Generation and Camera Pose Estimation

The paper "On Unifying Video Generation and Camera Pose Estimation" explores the integration of video generation and 3D camera pose estimation into a unified framework, termed JOint Generation and 3D camera Reconstruction (). The research is motivated by the need to evaluate the 3D awareness capabilities of video generation models, inspired by similar investigations in image generation models. The novelty of this work lies in its approach to leveraging the features of video generators for tasks that require 3D spatial reasoning, specifically through the lens of camera pose estimation.

Key Contributions

Unified Model Design: The proposed model, JOint Generation and 3D camera Reconstruction (), fuses video generation with 3D camera pose estimation within a single architecture. This enables the simultaneous handling of tasks, which include text-to-video (T2V) generation, video-to-camera estimation (V2C), and joint video-and-camera generation (T2V+C). The implementation leverages OpenSora, a Diffusion Transformer-based video generation model, and integrates it with Dust3R, a state-of-the-art multi-view stereo reconstruction method, for camera pose estimation.
Evaluation of 3D Awareness: The paper assesses the inherent 3D awareness in video generation features by utilizing the intermediate outputs from the denoising network of OpenSora. These features are routed to Dust3R decoders for camera pose estimation, providing insight into the model's capability of understanding 3D spatial relationships.
Performance and Versatility: The unified model demonstrates competitive camera pose estimation performance comparable to specialized models such as DUSt3R and GLOMAP, but additionally retains the capability for high-quality video generation. The paper reports quantitative metrics such as rotation and translation errors, and qualitative assessments of video generation fidelity using FID and FVD scores, demonstrating the trade-offs and advantages of their integrated system.

Experimental Setup

Experiments were conducted using two datasets: RealEstate10K and DL3DV10K. The RealEstate10K dataset served as both a training and primary evaluation set, while the DL3DV10K dataset provided a means to evaluate the model's generalization capabilities. The evaluated metrics included rotational and translational errors for camera estimation, as well as FID and FVD for video quality. Comparing their model to both pretrained and fine-tuned baselines highlighted the robustness and adaptability of .

Findings and Implications

The paper finds that while native video generation features provide some level of 3D awareness, this can be significantly enhanced through task-specific fine-tuning aimed at camera pose estimation. The unified architecture does not exhibit a synergistic increase in performance across both tasks; however, it manages to balance these functionalities effectively without significant detriment to either.

The integration of video generation and 3D camera tracking in a single model suggests potential practical applications in fields requiring efficient and compact architectures, such as augmented reality or robotics, where understanding both temporal dynamics and spatial arrangements is critical. Future work could further explore extensions to improve the synergy between video generation and camera estimation tasks or refine the feature fusion across different stages of the network to enhance both video quality and spatial coherence.

In summary, the paper presents a pioneering attempt to harmonize video generation with spatial reasoning tasks, providing a foundation for multifaceted video synthesis and analysis in artificial intelligence.

PDF Markdown

Related Papers

Tweets

https://twitter.com/zhenjun_zhao/status/1875042731577463030

https://twitter.com/arxivsanitybot/status/1875937215320117508