- The paper introduces ReconX, a novel framework that transforms sparse-view 3D reconstruction into a temporal generation task using video diffusion models.
- It integrates a global point cloud for 3D structure guidance, ensuring 3D consistent frame synthesis and mitigating artifacts from limited input views.
- Empirical results demonstrate higher PSNR, SSIM, and improved generalizability over state-of-the-art methods across diverse datasets.
ReconX: Reconstruct Any Scene from Sparse Views with Video Diffusion Model
The paper "ReconX: Reconstruct Any Scene from Sparse Views with Video Diffusion Model" addresses one of the critical challenges in the domain of 3D scene reconstruction—effectively rendering high-quality, detailed 3D scenes from a limited number of 2D images, known as sparse-view reconstruction. Despite the advancements provided by methods such as NeRF and 3D Gaussian Splatting (3DGS), these approaches often suffer from artifacts and distortions when presented with insufficient input views. This paper proposes a novel paradigm, termed ReconX, which tackles this problem by leveraging the strong generative capabilities of large pre-trained video diffusion models.
Key Approach
The core innovation introduced in this paper is the transformation of the reconstruction task into a temporal generation task. The authors hypothesize that the generative priors embedded within large video diffusion models can be effectively utilized to synthesize realistic video frames from sparse views, thereby creating additional observations that aid in the reconstruction process.
To address the challenge of maintaining 3D view consistency, which is a common problem when utilizing pre-trained models, ReconX incorporates a global point cloud representation derived from limited input views to encode the 3D structure of the scene. This 3D structure guidance is then integrated into the video diffusion process to generate video frames that preserve the coherence and consistency of the scene from various perspectives.
The workflow of ReconX involves three pivotal steps:
- 3D Structure Guidance - Generate a global point cloud using a pose-free stereo reconstruction method, which is then encoded into a context representation space.
- 3D Consistent Frame Generation - Utilize the video diffusion model to generate detail-preserved, 3D consistent video frames, guided by the 3D structure representation.
- 3D Scene Reconstruction - Employ a confidence-aware 3D Gaussian Splatting optimization scheme to reconstruct the 3D scene from the generated video frames.
Numerical Results and Experimental Validation
The paper provides extensive empirical evidence showcasing ReconX's superiority over state-of-the-art methods through experiments conducted on various real-world datasets, including RealEstate10K and ACID. Quantitatively, ReconX consistently achieves higher PSNR, SSIM, and lower LPIPS scores across multiple scenarios, particularly excelling in cases where input views have large angle variances. For example, ReconX achieves a PSNR of 28.31 and SSIM of 0.912 on the RealEstate10K dataset, significantly surpassing existing approaches such as pixelSplat and MVSplat.
Furthermore, ReconX demonstrates outstanding generalizability to out-of-distribution data, a critical requirement for practical applications. When tested on datasets like NeRF-LLFF and DTU—datasets not seen during training—ReconX maintained high fidelity and generalizability in its reconstruction results. This capability underscores the robustness of the proposed method in real-world settings where input views can be sparsely captured and significantly varied.
Theoretical Implications and Future Prospects
The theoretical contributions of this paper lie in the novel integration of 3D structure guidance into the video diffusion process. The authors provide a theoretical foundation that demonstrates why incorporating native 3D priors leads to a more constrained and optimal solution space for the reconstruction task. This argument is formalized through a proof that demonstrates how the divergence between the true distribution of rendered 2D images and the approximated distribution by the diffusion model is minimized more effectively when 3D guidance is utilized.
Practically, this research has broad implications for fields such as virtual reality, autonomous navigation, and any domain requiring high-quality 3D reconstructions from limited data. Future research could further enhance ReconX by integrating it with larger, more robust video diffusion models or exploring its application in dynamic scene reconstruction involving temporal changes.
Conclusion
The ReconX framework presents a significant step forward in the field of sparse-view 3D reconstruction by innovatively reframing the problem as a temporal generation task. By leveraging the generative power of video diffusion models and incorporating 3D structure guidance, ReconX achieves state-of-the-art performance in terms of quality and generalizability. The empirical evidence and theoretical foundations laid out in this paper will likely inspire further research into more efficient and robust methods for 3D scene reconstruction from sparse views.