Decompositional Neural Scene Reconstruction with Generative Diffusion Prior (2503.14830v1)

Published 19 Mar 2025 in cs.CV

Abstract: Decompositional reconstruction of 3D scenes, with complete shapes and detailed texture of all objects within, is intriguing for downstream applications but remains challenging, particularly with sparse views as input. Recent approaches incorporate semantic or geometric regularization to address this issue, but they suffer significant degradation in underconstrained areas and fail to recover occluded regions. We argue that the key to solving this problem lies in supplementing missing information for these areas. To this end, we propose DP-Recon, which employs diffusion priors in the form of Score Distillation Sampling (SDS) to optimize the neural representation of each individual object under novel views. This provides additional information for the underconstrained areas, but directly incorporating diffusion prior raises potential conflicts between the reconstruction and generative guidance. Therefore, we further introduce a visibility-guided approach to dynamically adjust the per-pixel SDS loss weights. Together these components enhance both geometry and appearance recovery while remaining faithful to input images. Extensive experiments across Replica and ScanNet++ demonstrate that our method significantly outperforms SOTA methods. Notably, it achieves better object reconstruction under 10 views than the baselines under 100 views. Our method enables seamless text-based editing for geometry and appearance through SDS optimization and produces decomposed object meshes with detailed UV maps that support photorealistic Visual effects (VFX) editing. The project page is available at https://dp-recon.github.io/.

Summary

Decompositional Neural Scene Reconstruction with Generative Diffusion Prior

This paper addresses a challenging task in computer vision: the reconstruction of 3D scenes from sparse input views. Current methods capable of decompositional reconstruction, which aim to manage each object within a scene as individual entities, often suffer in regions of limited visibility or occlusion. The work presented introduces an innovative method leveraging pre-trained generative diffusion models to supplement the missing information in underconstrained scene areas.

Methodology

The proposed pipeline employs a unique approach by using diffusion priors, specifically Score Distillation Sampling (SDS), to enhance neural scene reconstruction. This involves optimizing a neural representation under novel views to provide guidance in reconstructing both geometry and appearance. The novel solutions to the inherent conflicts between the generative guidance and the input data are also proposed, which involve:

Visibility-Guided Strategy: A key innovation is the introduction of a novel visibility-guided strategy that dynamically adjusts per-pixel SDS loss weights based on the visibility of each point in the captured scenes. This approach helps balance the competing priorities of maintaining fidelity to the input views and inferring plausible details in occluded regions.
Decompositional Reconstruction Framework: The authors utilize a decomposition strategy that employs a neural implicit surface approach. The decompositional framework enables the clear representation and manipulation of individual objects within the scene, thus supporting downstream applications like flexible text-based editing of objects.

Results

Extensive evaluations demonstrate that this approach significantly outperforms state-of-the-art methods across several challenging datasets, including Replica and ScanNet++. Notably, the method achieved superior object reconstruction results using as few as 10 input views compared to baseline methods requiring 100 views. These outcomes indicate both improved computational efficiency and enhanced reconstruction quality, especially in partially observed or highly occluded scene regions.

Implications and Future Work

From a practical standpoint, the method's ability to facilitate detailed and flexible editing of scenes—through text-based commands for geometry and appearance—significantly enhances its utility in applications such as virtual reality, augmented reality, and other visual effects domains. Theoretically, this advancement in utilizing generative models for sparse-view reconstruction sets a foundation for further exploration into integrating neural priors with scene understanding tasks.

Looking forward, further exploration is plausible in refining the visibility-guided mechanisms and exploring novel training regimes that optimize across larger scales. Additionally, this method's performance in highly cluttered or complex dynamic environments could also be an avenue for further innovation. An investigation into improving the method's capability in capturing fine-grained details and various materials within scenes would continue to push the boundaries of this domain.

In conclusion, this paper offers a promising step towards more comprehensive and efficient 3D scene reconstructions from minimal data inputs by ingeniously incorporating generative diffusion priors. It not only underscores the versatility and robustness of neural implicit representations enhanced with generative models but also opens up a wealth of opportunities for future research in AI-driven 3D scene understanding and manipulation.

Related Papers

Find Related Papers

GitHub

Tweets

https://twitter.com/_akhaliq/status/1902747137944764604

https://twitter.com/semisance/status/1902634051497775219