- The paper introduces HAD, a novel framework that integrates pixel-wise hallucination scoring to detect and suppress inconsistent generative artifacts in diffusion priors.
- HAD employs a frozen multi-view encoder paired with a learnable scoring branch to compute per-pixel errors and mask unreliable regions during 3D reconstruction.
- Experimental results on DL3DV and MipNeRF360 show significant improvements in PSNR, SSIM, and LPIPS, validating HAD’s effectiveness and broad applicability.
Hallucination-Aware Diffusion Priors for High-Fidelity 3D Reconstruction
Problem Statement and Motivation
The integration of diffusion models as generative priors has led to substantial advancements in sparse-view 3D reconstruction, notably in enhancing the quality of novel view synthesis via algorithms such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS). However, these diffusion priors fundamentally operate by creating novel views that, while photorealistic, often disregard strict content fidelity to the provided multi-view observations. This generative inconsistency manifests as hallucinated scene elements—spurious structures or textures absent in the true observations—which, when integrated into the underlying 3D representation, propagate as systemic errors deteriorating the geometry and appearance in the final scene reconstructions.
Conventional mitigation strategies that attempt to enforce fidelity within the generative diffusion process provide only partial remedies. The core limitation remains that diffusion models, by construction, can synthesize locally plausible but globally inconsistent content when conditioned on sparse or imperfect observations. This paper addresses the necessity of explicit hallucination detection and suppression within the generative prior pipeline for 3D reconstruction.
Hallucination-Aware Diffusion (HAD): Methodology
Hallucination Score Modeling
The cornerstone of this work is the introduction of Hallucination-Aware Diffusion priors (HAD). HAD augments the traditional diffusion model-based view synthesis by generating, for every diffusion-augmented novel view, a pixel-wise hallucination score map that quantifies the degree of inconsistency between that view and the set of observed images. The hallucination score estimation is implemented via a network comprising a frozen multi-view encoder and a learnable scoring branch. The multi-view encoder is instantiated by the feature backbone of a state-of-the-art feedforward novel view synthesis network (LVSM), which encodes context from all input views at the novel target pose. This setup leverages the geometric reasoning capabilities of the NVS encoder—critically important for hallucination detection under sparse-view conditions.
The hallucination score branch receives as input the encoded context, the synthesized image from the diffusion prior, and optionally the direct 3DGS rendering. The network is trained to predict the mean absolute per-pixel error between the diffusion-augmented image and ground-truth novel observations, thus directly supervising the detection of hallucinated content.
Hallucination Masking and Multi-Sampling Fusion
These hallucination score maps are thresholded to construct binary reliability masks. During 3DGS optimization, only reliable (low-hallucination) pixels from the augmented novel views are used to supervise the model, thereby suppressing the propagation of generative artifacts.
To further increase hallucination suppression and coverage of reliable content, HAD adopts a multi-sampling fusion schedule: multiple versions of the same novel viewpoint are generated by conditioning the diffusion prior on different randomly selected reference views. For each pixel, the version with the lowest hallucination score is selected. This strategy maximizes the utility of complementary multi-view cues and minimizes exposure to hallucinated artifacts without requiring architectural retraining or reconfiguration of the underlying diffusion prior.
HAD is agnostic to the specific choice of the diffusion prior and merely operates downstream by filtering its outputs. This allows the framework to generalize across image, video, and multi-view diffusion pipelines.
Experimental Evaluation and Ablation Analysis
HAD is rigorously evaluated on DL3DV and MipNeRF360—the two primary benchmarks for sparse-view novel view synthesis and 3DGS-based 3D reconstruction. The method is compared against a comprehensive selection of baselines and SoTA methods, including both direct NVS methods (e.g., LVSM, DepthSplat) and diffusion-assisted 3DGS pipelines (e.g., Difix3D, GenFusion). Metrics include PSNR, SSIM, and LPIPS for overall photometric and perceptual fidelity.
On DL3DV, HAD achieves a PSNR improvement of 0.78dB over Difix3D and a measurable increase in SSIM and reduction in LPIPS, with its full pipeline reaching 22.13 dB (PSNR), 0.757 (SSIM), 0.19 (LPIPS). On cross-domain evaluation with MipNeRF360, similar improvements are observed (+0.69 dB PSNR over Difix3D), conclusively demonstrating the hallucination-aware framework's effectiveness and generalization potential.
Hallucination Score Network Analysis
Ablation studies clarify that the use of a pretrained multi-view encoder is essential; retraining from scratch or omitting this backbone results in degraded hallucination score estimation (MAE increased from 0.043 to 0.054) and diminished 3D reconstruction accuracy. The incorporation of 3DGS-rendered views as auxiliary input for hallucination scoring yields only marginal improvements.
Multi-Sampling and Fusion
Increasing the number of multi-sampled versions per novel view systematically improves PSNR, but with diminishing returns. The pixel-wise ArgMin fusion strategy (selecting the version with the minimum hallucination score for each pixel) outperforms weighted averaging, confirming that direct selection is more robust for artifact suppression.
Generalization and Transferability
HAD's hallucination scoring network, trained only with image diffusion-augmented samples, effectively detects and masks hallucinations in video and multi-view diffusion pipelines (e.g., GenFusion, Stable Virtual Camera) without any additional fine-tuning, highlighting its robust cross-paradigm transfer. Integrating HAD into existing SoTA pipelines directly improves their quantitative performance (e.g., GenFusion +0.23 PSNR).
Dense-View Regime
While the hallucination issue is most acute in sparse-view settings, HAD provides measurable (albeit reduced) improvements even in dense view regimes (24 views), indicating that hallucination suppression is not only beneficial but also additive to baseline performance under all input sparsity levels.
Implications and Future Directions
HAD establishes hallucination awareness as a mandatory component for the safe and reliable use of generative diffusion priors in 3D scene reconstruction, especially as foundation models become more deeply integrated into geometric pipelines. The explicit detection and filtering of hallucinated content are crucial for maintaining input view consistency, especially in downstream tasks such as inverse rendering, video-based novel view synthesis, and AR/VR scene editing where geometric or semantic fidelity is non-negotiable.
The work opens several immediate research directions:
- Scalability: Removing the dependence on curated datasets with ground-truth target views and shifting toward uncalibrated or weakly-supervised settings, as pioneered in self-supervised NVS pipelines.
- Generalization to Unseen Regions: Extending hallucination-aware modeling to enable controlled synthesis, rather than mere suppression, in genuinely unseen or ambiguous regions.
- Diffusion Model Calibration: Leveraging the hallucination score as an adaptive prior in the diffusion process itself, possibly incorporating feedback to reduce generative inconsistencies at source.
- Uncertainty Quantification: Bridging hallucination scores with calibrated uncertainty maps to inform downstream tasks about the reliability of generated 3D models.
Conclusion
HAD provides a systematic, effective approach for integrating diffusion-based generative priors into 3DGS pipelines while ensuring view fidelity via explicit hallucination detection and suppression. The approach is modular, generalizes across diverse generative paradigms, and sets a new performance baseline for sparse-view 3D reconstruction by demonstrating that hallucination awareness is critical for high-fidelity, reliable geometry synthesis (2605.16873).