Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 73 tok/s

Gemini 2.5 Pro 40 tok/s Pro

GPT-5 Medium 32 tok/s Pro

GPT-5 High 28 tok/s Pro

GPT-4o 75 tok/s Pro

Kimi K2 184 tok/s Pro

GPT OSS 120B 466 tok/s Pro

Claude Sonnet 4.5 35 tok/s Pro

2000 character limit reached

GenFusion: Closing the Loop between Reconstruction and Generation via Videos (2503.21219v2)

Published 27 Mar 2025 in cs.CV and cs.AI

Abstract: Recently, 3D reconstruction and generation have demonstrated impressive novel view synthesis results, achieving high fidelity and efficiency. However, a notable conditioning gap can be observed between these two fields, e.g., scalable 3D scene reconstruction often requires densely captured views, whereas 3D generation typically relies on a single or no input view, which significantly limits their applications. We found that the source of this phenomenon lies in the misalignment between 3D constraints and generative priors. To address this problem, we propose a reconstruction-driven video diffusion model that learns to condition video frames on artifact-prone RGB-D renderings. Moreover, we propose a cyclical fusion pipeline that iteratively adds restoration frames from the generative model to the training set, enabling progressive expansion and addressing the viewpoint saturation limitations seen in previous reconstruction and generation pipelines. Our evaluation, including view synthesis from sparse view and masked input, validates the effectiveness of our approach. More details at https://genfusion.sibowu.com.

Summary

An Expert Overview of "GenFusion: Closing the Loop between Reconstruction and Generation via Videos"

The advancement of 3D reconstruction and generation methodologies has progressed significantly, exemplified by the "GenFusion" approach. The paper delineates an ambitious strategy aimed at bridging the divide between 3D reconstruction and generation domains by leveraging a reconstruction-driven video diffusion model. The central aim is to address the misalignment between 3D constraints and generative priors that has historically impeded the scalability and utility of 3D scene reconstruction and generation applications.

Core Concept and Methodology

GenFusion proposes a cyclical fusion framework that iterates between reconstruction and generation to iteratively enhance and expand the 3D scene representation. This proposition is built on the observation that traditional 3D reconstruction requires extensive view coverage, which is not conducive to generative models reliant on sparse or unimodal inputs.

The GenFusion methodology fundamentally comprises two phases:

Reconstruction-driven Generation: A video diffusion model predicated on artifact-prone RGB-D renderings is employed to facilitate novel video generation that is consistent in terms of view and quality. This methodology involves fin tuning existing generative models to accommodate depth (RGB-D VAEs) information, thus providing an enhanced understanding of scene geometry.
Cyclical Fusion: This phase iteratively enhances the 3D representation by rendering novel views, using them as feedback to refine the reconstruction model, effectively correcting artifacts and generating new content in under-observed areas.

Empirical Evaluation

The authors employ a robust evaluation framework across various datasets, including DL3DV and Tanks and Temples. The structure of the evaluation emphasizes view synthesis from sparse and masked inputs, iterating towards a more artifact-resilient model. Strong empirical results underscore GenFusion's efficacy, with significant improvements in key metrics like PSNR, SSIM, and LPIPS over traditional methods, particularly in scenarios with minimal input views.

Contribution and Implications

GenFusion stands out for its principled, cyclic approach that harnesses the strengths of both reconstruction and generative models—offering a compelling solution to the artifact issues in 3D reconstructions from sparse views. This work suggests a promising trajectory for future work in 3D reconstruction and novel view synthesis, particularly in areas such as augmented reality and autonomous navigation, which demand high fidelity and scalable 3D data generation.

Future Directions

While GenFusion achieves a commendable alignment between reconstruction and generation domains, potential areas for development include optimizing the computational overhead associated with iterative diffusion steps, and further enhancing the spatial resolution of generated content. Additionally, resolving blurriness in large extrapolated regions via more sophisticated sequence handling could significantly increase view consistency in generated models.

In summary, GenFusion presents a notable advancement in closing the gap between 3D reconstruction and generation, setting the stage for broader applications and deeper integration of these two paradigms in real-world environments.