ReconX: Reconstruct Any Scene from Sparse Views with Video Diffusion Model (2408.16767v4)

Published 29 Aug 2024 in cs.CV, cs.AI, and cs.GR

Abstract: Advancements in 3D scene reconstruction have transformed 2D images from the real world into 3D models, producing realistic 3D results from hundreds of input photos. Despite great success in dense-view reconstruction scenarios, rendering a detailed scene from insufficient captured views is still an ill-posed optimization problem, often resulting in artifacts and distortions in unseen areas. In this paper, we propose ReconX, a novel 3D scene reconstruction paradigm that reframes the ambiguous reconstruction challenge as a temporal generation task. The key insight is to unleash the strong generative prior of large pre-trained video diffusion models for sparse-view reconstruction. However, 3D view consistency struggles to be accurately preserved in directly generated video frames from pre-trained models. To address this, given limited input views, the proposed ReconX first constructs a global point cloud and encodes it into a contextual space as the 3D structure condition. Guided by the condition, the video diffusion model then synthesizes video frames that are both detail-preserved and exhibit a high degree of 3D consistency, ensuring the coherence of the scene from various perspectives. Finally, we recover the 3D scene from the generated video through a confidence-aware 3D Gaussian Splatting optimization scheme. Extensive experiments on various real-world datasets show the superiority of our ReconX over state-of-the-art methods in terms of quality and generalizability.

Citations (10)

View on Semantic Scholar

Summary

The paper introduces ReconX, a novel framework that transforms sparse-view 3D reconstruction into a temporal generation task using video diffusion models.
It integrates a global point cloud for 3D structure guidance, ensuring 3D consistent frame synthesis and mitigating artifacts from limited input views.
Empirical results demonstrate higher PSNR, SSIM, and improved generalizability over state-of-the-art methods across diverse datasets.

ReconX: Reconstruct Any Scene from Sparse Views with Video Diffusion Model

The paper "ReconX: Reconstruct Any Scene from Sparse Views with Video Diffusion Model" addresses one of the critical challenges in the domain of 3D scene reconstruction—effectively rendering high-quality, detailed 3D scenes from a limited number of 2D images, known as sparse-view reconstruction. Despite the advancements provided by methods such as NeRF and 3D Gaussian Splatting (3DGS), these approaches often suffer from artifacts and distortions when presented with insufficient input views. This paper proposes a novel paradigm, termed ReconX, which tackles this problem by leveraging the strong generative capabilities of large pre-trained video diffusion models.

Key Approach

The core innovation introduced in this paper is the transformation of the reconstruction task into a temporal generation task. The authors hypothesize that the generative priors embedded within large video diffusion models can be effectively utilized to synthesize realistic video frames from sparse views, thereby creating additional observations that aid in the reconstruction process.

To address the challenge of maintaining 3D view consistency, which is a common problem when utilizing pre-trained models, ReconX incorporates a global point cloud representation derived from limited input views to encode the 3D structure of the scene. This 3D structure guidance is then integrated into the video diffusion process to generate video frames that preserve the coherence and consistency of the scene from various perspectives.

The workflow of ReconX involves three pivotal steps:

3D Structure Guidance - Generate a global point cloud using a pose-free stereo reconstruction method, which is then encoded into a context representation space.
3D Consistent Frame Generation - Utilize the video diffusion model to generate detail-preserved, 3D consistent video frames, guided by the 3D structure representation.
3D Scene Reconstruction - Employ a confidence-aware 3D Gaussian Splatting optimization scheme to reconstruct the 3D scene from the generated video frames.

Numerical Results and Experimental Validation

The paper provides extensive empirical evidence showcasing ReconX's superiority over state-of-the-art methods through experiments conducted on various real-world datasets, including RealEstate10K and ACID. Quantitatively, ReconX consistently achieves higher PSNR, SSIM, and lower LPIPS scores across multiple scenarios, particularly excelling in cases where input views have large angle variances. For example, ReconX achieves a PSNR of 28.31 and SSIM of 0.912 on the RealEstate10K dataset, significantly surpassing existing approaches such as pixelSplat and MVSplat.

Furthermore, ReconX demonstrates outstanding generalizability to out-of-distribution data, a critical requirement for practical applications. When tested on datasets like NeRF-LLFF and DTU—datasets not seen during training—ReconX maintained high fidelity and generalizability in its reconstruction results. This capability underscores the robustness of the proposed method in real-world settings where input views can be sparsely captured and significantly varied.

Theoretical Implications and Future Prospects

The theoretical contributions of this paper lie in the novel integration of 3D structure guidance into the video diffusion process. The authors provide a theoretical foundation that demonstrates why incorporating native 3D priors leads to a more constrained and optimal solution space for the reconstruction task. This argument is formalized through a proof that demonstrates how the divergence between the true distribution of rendered 2D images and the approximated distribution by the diffusion model is minimized more effectively when 3D guidance is utilized.

Practically, this research has broad implications for fields such as virtual reality, autonomous navigation, and any domain requiring high-quality 3D reconstructions from limited data. Future research could further enhance ReconX by integrating it with larger, more robust video diffusion models or exploring its application in dynamic scene reconstruction involving temporal changes.

Conclusion

The ReconX framework presents a significant step forward in the field of sparse-view 3D reconstruction by innovatively reframing the problem as a temporal generation task. By leveraging the generative power of video diffusion models and incorporating 3D structure guidance, ReconX achieves state-of-the-art performance in terms of quality and generalizability. The empirical evidence and theoretical foundations laid out in this paper will likely inspire further research into more efficient and robust methods for 3D scene reconstruction from sparse views.

PDF Markdown

Related Papers

Tweets

https://twitter.com/janusch_patas/status/1829345007460819431

https://twitter.com/zhenjun_zhao/status/1829364843960476101

https://twitter.com/taziku_co/status/1829675454930043273

https://twitter.com/javaeeeee1/status/1829693208026042805

https://twitter.com/arXivGPT/status/1830330755093766323

https://twitter.com/susumuota/status/1839819100182241615

YouTube

Show All Videos