Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
175 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Difix3D+: Improving 3D Reconstructions with Single-Step Diffusion Models (2503.01774v1)

Published 3 Mar 2025 in cs.CV

Abstract: Neural Radiance Fields and 3D Gaussian Splatting have revolutionized 3D reconstruction and novel-view synthesis task. However, achieving photorealistic rendering from extreme novel viewpoints remains challenging, as artifacts persist across representations. In this work, we introduce Difix3D+, a novel pipeline designed to enhance 3D reconstruction and novel-view synthesis through single-step diffusion models. At the core of our approach is Difix, a single-step image diffusion model trained to enhance and remove artifacts in rendered novel views caused by underconstrained regions of the 3D representation. Difix serves two critical roles in our pipeline. First, it is used during the reconstruction phase to clean up pseudo-training views that are rendered from the reconstruction and then distilled back into 3D. This greatly enhances underconstrained regions and improves the overall 3D representation quality. More importantly, Difix also acts as a neural enhancer during inference, effectively removing residual artifacts arising from imperfect 3D supervision and the limited capacity of current reconstruction models. Difix3D+ is a general solution, a single model compatible with both NeRF and 3DGS representations, and it achieves an average 2$\times$ improvement in FID score over baselines while maintaining 3D consistency.

Summary

  • The paper introduces Difix3D+, a pipeline leveraging a single-step diffusion model named Difix to enhance 3D reconstructions and novel-view synthesis by addressing rendering artifacts.
  • Difix improves the 3D representation by cleaning intermediate views for distillation during reconstruction and acts as a real-time post-processor for inference, compatible with both NeRF and 3DGS.
  • The method achieves state-of-the-art results, demonstrating an average 2x improvement in FID and over 1dB PSNR gain while maintaining 3D consistency across novel views.

The paper introduces Difix3D+, a pipeline designed to enhance 3D reconstruction and novel-view synthesis using single-step diffusion models. The approach aims to address the challenge of generating photorealistic renderings from novel viewpoints, where artifacts often persist in existing Neural Radiance Fields (NeRFs) and 3D Gaussian Splatting (3DGS) representations. Difix3D+ leverages a single-step image diffusion model, named Difix, to refine rendered novel views by removing artifacts caused by under-constrained regions in the 3D representation.

Difix plays two main roles:

  • During reconstruction, it cleans up pseudo-training views rendered from the reconstruction, which are then distilled back into the 3D representation.
  • During inference, it acts as a neural enhancer, removing residual artifacts.

The paper highlights that Difix3D+ is compatible with both NeRF and 3DGS representations and achieves an average 2×\times improvement in Fréchet Inception Distance (FID) score over baselines while maintaining 3D consistency.

Introduction

Recent advancements in neural rendering, such as NeRF and 3DGS, have shown promise in novel-view synthesis. However, these methods still encounter challenges, particularly in rendering less observed areas or extreme novel views. The underlying limitation of NeRF and 3DGS approaches is their per-scene optimization framework, which is susceptible to shape-radiance ambiguity.

The paper addresses the challenge of using 2D diffusion priors to improve 3D reconstruction of large scenes efficiently. It builds upon single-step diffusion techniques and adapts them to "fix" artifacts in NeRF/3DGS renderings. The fine-tuned model, Difix, generates pseudo-training views during the reconstruction phase, which are then distilled back into 3D to enhance quality in under-constrained regions. Additionally, Difix is applied as a real-time post-processing step to improve quality further.

The key contributions of the paper are:

  • Adapting 2D diffusion models to remove artifacts from rendered 3D neural representations.
  • Proposing an update pipeline that refines the 3D representation by distilling back improved novel views.
  • Demonstrating how single-step diffusion models enable near real-time post-processing.
  • Achieving state-of-the-art (SoTA) results, with improvements of over 1dB in PSNR and a 2×\times improvement in FID.

Related Work

The paper reviews prior work in scene reconstruction and novel-view synthesis, focusing on approaches for improving 3D reconstruction discrepancies, using priors for novel view synthesis, and applying generative priors. It discusses methods that improve NeRF's robustness to noisy camera inputs by optimizing camera poses and addressing lighting variations. The paper also highlights the use of geometric priors and generative priors from GANs and diffusion models to enhance novel view synthesis.

Background

The paper provides background information on 3D scene reconstruction, novel-view synthesis, NeRFs, 3DGS, and Diffusion Models (DMs).

NeRF represents scenes as an emissive volume encoded within the weights of a coordinate-based multilayer perceptron (MLP). The color of a ray r(τ)=o+td\mathbf{r}(\tau) = \mathbf{o} + t\mathbf{d} is rendered using:

C(p)=i=1Nαiciji1(1αi)\mathcal{C}(\mathbf{p}) = \sum_{i=1}^N \alpha_i \mathbf{c}_i \prod_j^{i-1} (1 - \alpha_i)

where

  • αi=(1exp(σiδi))\alpha_i = (1 - \exp(-\sigma_i \delta_i))
  • NN denotes the number of samples along the ray
  • δi\delta_i is the step size.

3DGS uses volumetric particles parameterized by position μR3\boldsymbol{\mu} \in \mathbb{R}^3, rotation rR4\mathbf{r} \in \mathbb{R}^4, scale sR3\mathbf{s} \in \mathbb{R}^3, opacity ηR\eta \in \mathbb{R}, and color ci\mathbf{c}_i.

DMs learn to model the data distribution pdata(x)p_{\text{data}}(\mathbf{x}) through iterative denoising and are trained with denoising score matching. Learnable parameters θ\bm{\theta} of the denoiser model Fθ\mathbf{F}_{\bm{\theta}} are optimized using the denoising score matching objective:

Expdata,τpτ,ϵN(0,I)[ϵFθ(xτ;c,τ)22]E_{\mathbf{x} \sim p_{\text{data}}, \tau \sim p_{\tau}, \epsilon \sim \mathcal{N}(\mathbf{0}, I)} \left[\Vert \epsilon - \mathbf{F}_\theta(\mathbf{x}_\tau; \mathbf{c}, \tau) \Vert_2^2 \right]

Boosting 3D Reconstruction with DM priors

The paper details the approach to reconstruct a 3D representation that enables realistic novel view synthesis, particularly in under-constrained regions. The approach leverages the generative priors of a pre-trained diffusion model during optimization and inference.

Difix: From a pretrained diffusion model to a 3D Artifact Fixer

The model refines a rendered novel view prediction I^\hat{I} from a noisy rendered novel view $\Tilde{I}$ and a set of clean reference views IrefI_{\text{ref}}. The model is built on top of a single-step diffusion model SD-Turbo.

To capture cross-view dependencies, the self-attention layers are adapted into a reference mixing layer, using the following operation:

zrearrange(z,  bcv(hw)bc(vhw))\mathbf{z}' \leftarrow \text{rearrange}(\mathbf{z}, \; b c v (hw) \rightarrow b c (vhw))

zlϕi(z,z)\mathbf{z}' \leftarrow l_\phi^i(\mathbf{z}', \mathbf{z}')

zrearrange(z,  bc(vhw)bcv(hw))\mathbf{z}' \leftarrow \text{rearrange}(\mathbf{z}', \; b c (vhw) \rightarrow b c v (hw))

where lϕil_\phi^i is a self-attention layer applied over the vhwvhw dimension. The model is fine-tuned using a frozen VAE encoder and a LoRA fine-tuned decoder. The model is trained to directly take the degraded rendered image $\Tilde{I}$ as input with a noise level of τ=200\tau=200.

The diffusion model is supervised with losses derived from 2D supervision, including an L2 reconstruction loss, a perceptual LPIPS loss, and a style loss term.

Data Curation

The model requires a large dataset of image pairs containing artifacts and corresponding "clean" ground truth images. The paper explores various strategies to increase the amount of training examples, including cycle reconstruction, model underfitting, and cross-reference techniques.

Difix3D+: NVS with Diffusion Priors

To address inconsistencies across different poses/frames, the outputs of the diffusion model are distilled back into the 3D representation during training. This improves multi-view consistency and leads to higher perceptual quality. A final neural enhancer step is applied during rendering inference to remove residual artifacts.

To achieve multi-view consistency, an iterative training scheme is adopted that progressively grows the set of 3D cues.

Experiments

The paper evaluates Difix3D+ on in-the-wild scenes and automotive scenes. It compares the approach against several baselines and analyzes its ability to enhance both NeRF and 3DGS-based pipelines. The method is also ablated.

In-the-Wild Artifact Removal

Difix is trained on a random selection of scenes from the DL3DV benchmark dataset. The evaluation is performed with Nerfacto and 3DGS backbones on held-out scenes from the DL3DV benchmark and the Nerfbusters dataset. The evaluation metrics include PSNR, SSIM, LPIPS, and FID score. The results show that Difix3D+ outperforms comparison methods, with significant improvements in perceptual quality and visual fidelity.

Automotive Scene Enhancement

The automotive capture rig contains three cameras with 40 degree overlaps between each camera. Difix is trained with 40 scenes, and NeRF is trained with the center camera, and the other two cameras are evaluated as novel views. The method outperforms its baselines across all metrics.

Diagnostics

The method is ablated by applying pipeline components incrementally and comparing Nerfacto to directly running Difix on rendered views, distilling Difix outputs via 3D updates, applying incremental 3D updates, and adding Difix as a post-rendering step. The results validate the effectiveness of the incremental update strategy and the benefits of post-rendering processing.

Conclusion

The paper concludes by summarizing Difix3D+ as a pipeline for enhancing 3D reconstruction and novel-view synthesis. The approach leverages a single-step diffusion model (Difix) for real-time artifact removal and improves 3D representation quality through a progressive 3D update scheme.