Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
9 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

GeometryCrafter: Consistent Geometry Estimation for Open-world Videos with Diffusion Priors (2504.01016v1)

Published 1 Apr 2025 in cs.GR, cs.AI, and cs.CV

Abstract: Despite remarkable advancements in video depth estimation, existing methods exhibit inherent limitations in achieving geometric fidelity through the affine-invariant predictions, limiting their applicability in reconstruction and other metrically grounded downstream tasks. We propose GeometryCrafter, a novel framework that recovers high-fidelity point map sequences with temporal coherence from open-world videos, enabling accurate 3D/4D reconstruction, camera parameter estimation, and other depth-based applications. At the core of our approach lies a point map Variational Autoencoder (VAE) that learns a latent space agnostic to video latent distributions for effective point map encoding and decoding. Leveraging the VAE, we train a video diffusion model to model the distribution of point map sequences conditioned on the input videos. Extensive evaluations on diverse datasets demonstrate that GeometryCrafter achieves state-of-the-art 3D accuracy, temporal consistency, and generalization capability.

Summary

Overview of GeometryCrafter: Consistent Geometry Estimation for Open-world Videos with Diffusion Priors

The paper presents GeometryCrafter, a novel solution for estimating temporally consistent geometry from diverse open-world videos. This is achieved through high-fidelity point map sequences, thereby advancing applications such as 3D/4D reconstruction and depth-based video editing. The approach addresses limitations observed in traditional video depth estimation methods, especially concerning geometric fidelity and temporal consistency.

Core Methodology

Central to GeometryCrafter is the implementation of a point map Variational Autoencoder (VAE) designed to efficiently encode and decode point map sequences without falling prey to the compression errors typical in prior methods. This is facilitated by deeply learning a latent space agnostic to video distributions, enhancing robustness and generalization capabilities. Furthermore, the VAE model integrates a video diffusion model, leveraging its potential to handle unbounded depth values efficiently.

The framework’s capability to effectively predict 3D geometry, depth maps, and camera parameters stems from its unique representation strategy. This strategy disassociates depth into log-space depth and diagonal field of view rather than a cuboid-based representation, significantly improving the geometric fidelity and maintaining temporal coherence across diverse video settings.

Strong Numerical Results

The paper includes extensive evaluations demonstrating GeometryCrafter's state-of-the-art performance. It consistently outperforms existing methods in point map estimation, verified across diverse datasets comprising dynamic and static scenes, indoor and outdoor environments, and varying resolutions and styles. The numerical results underscore the efficacy of the method, with notable improvements in metrics such as relative point error and depth accuracy.

Implications and Speculations

Practically, GeometryCrafter offers considerable implications for computer vision tasks requiring reliable 3D reconstruction and video editing, particularly in real-time navigation and virtual reality applications. Theoretically, the model contributes to the discourse on integrating diffusion models with traditional video processing techniques, promoting advancements in generative modeling for video content.

The foundational elements of GeometryCrafter could catalyze further developments in AI, potentially influencing future systems that require enhanced geometric processing capabilities. Anticipated advancements may include refined architectures that further exploit spatial-temporal interactions and advanced generative approaches for complex scene understanding.

Limitations

One noted limitation is the computational overhead inherent in its model size, which can hinder real-time applications or those requiring rapid inference. Addressing model efficiency while preserving output fidelity remains a key challenge for future iterations of the approach.

In summary, the paper presents a comprehensive framework for high-precision geometry estimation in video sequences, addressing notable challenges in the domain and paving the way for future exploration and enhancements in AI-driven video processing systems.