MU-GeNeRF: Multi-view Uncertainty-guided Generalizable Neural Radiance Fields for Distractor-aware Scene

Published 20 Apr 2026 in cs.CV | (2604.17965v1)

Abstract: Generalizable Neural Radiance Fields (GeNeRFs) enable high-quality scene reconstruction from sparse views and can generalize to unseen scenes. However, in real-world settings, transient distractors break cross-view structural consistency, corrupting supervision and degrading reconstruction quality. Existing distractor-free NeRF methods rely on per-scene optimization and estimate uncertainty from per-view reconstruction errors, which are not reliable for GeNeRFs and often misjudge inconsistent static structures as distractors. To this end, we propose MU-GeNeRF, a Multi-view Uncertainty-guided distractor-aware GeNeRF framework designed to alleviate GeNeRF's robust modeling challenges in the presence of transient distractions. We decompose distractor awareness into two complementary uncertainty components: Source-view Uncertainty, which captures structural discrepancies across source views caused by viewpoint changes or dynamic factors; and Target-view Uncertainty, which detects observation anomalies in the target image induced by transient distractors.These two uncertainties address distinct error sources and are combined through a heteroscedastic reconstruction loss, which guides the model to adaptively modulate supervision, enabling more robust distractor suppression and geometric modeling.Extensive experiments show that our method not only surpasses existing GeNeRFs but also achieves performance comparable to scene-specific distractor-free NeRFs.

Abstract PDF Upgrade to Chat

Authors (11)

Summary

The paper introduces MU-GeNeRF, a Generalizable Neural Radiance Field (GeNeRF) framework that achieves robust 3D scene reconstruction and novel view synthesis in dynamic environments by decomposing distractor awareness into two complementary uncertainty components: Source-view Uncertainty and Target-view Uncertainty.
MU-GeNeRF integrates these two uncertainty components into a heteroscedastic reconstruction loss, adaptively modulating the supervision signal to effectively suppress transient distractors and improve geometric modeling.
Experiments show that MU-GeNeRF outperforms existing GeNeRFs like ReTR and MuRF on dynamic scene datasets, achieving PSNRs of 21.77 and 20.33 on the On-the-go dataset, while requiring significantly less fine-tuning than scene-specific NeRF methods.

MU-GeNeRF: Multi-view Uncertainty-guided Generalizable Neural Radiance Fields for Distractor-aware Scene

Neural Radiance Fields (NeRF) and, more recently, Generalizable Neural Radiance Fields (GeNeRFs) have demonstrated significant capabilities in novel view synthesis and 3D reconstruction. A fundamental assumption in these frameworks is the static nature of the scene during data capture. This assumption is frequently violated in real-world scenarios due to the presence of transient distractors, such as dynamic objects or varying shadows, which introduce inconsistencies across multiple views. These inconsistencies corrupt the supervisory signal and consequently degrade reconstruction quality. Existing distractor-free NeRF methods typically rely on per-scene optimization and infer uncertainty from per-view reconstruction errors. However, this approach is often unreliable in GeNeRF settings, frequently misclassifying inconsistent static structures as distractors. The paper introduces MU-GeNeRF, a Multi-view Uncertainty-guided distractor-aware GeNeRF framework designed to enhance robust modeling in the presence of transient distractions by decomposing distractor awareness into two complementary uncertainty components: Source-view Uncertainty and Target-view Uncertainty (2604.17965).

Decomposing Uncertainty for Robust Modeling

MU-GeNeRF addresses the limitations of prior distractor-free NeRF methods in generalizable settings by proposing a novel framework that decouples the sources of reconstruction error. The core of this approach lies in the estimation of two distinct uncertainty types.

Source-view Uncertainty: This component is designed to capture structural discrepancies arising from viewpoint changes or dynamic factors across the source views. It is modeled through a generalizable feed-forward process. The feed-forward network, based on Transformer architectures from VolRecon [35] and ReTR [20], processes projected features, color samples, and spatial information from source images. The output includes both the predicted color and the Source-view Uncertainty for each sampled point. These point-wise uncertainties are then aggregated into a pixel-level Source-view Uncertainty using a Gaussian Mixture Model (GMM) (2604.17965). This GMM formulation allows for the inference of both color and uncertainty, producing an uncertainty map that reflects the reliability of aggregated multi-view information.

Target-view Uncertainty: This component is specifically designed to detect observation anomalies in the target image that are induced by transient distractors. It is estimated from semantic features extracted from the target image using a pre-trained DINOv2 network [31], which are then passed through a decoder to generate a dense uncertainty map (2604.17965). This approach provides a dense, spatially distributed uncertainty map that localizes potential distractors within the target view. Unlike per-ray uncertainty estimations in some prior works, the dense prediction fully leverages the spatial modeling capabilities of CNNs, simplifying training and enhancing robustness.

Multi-view Uncertainty-Guided Robust Distractor Suppression

The two complementary uncertainty components, Source-view Uncertainty ( $\beta_S$ ) and Target-view Uncertainty ( $\beta_T$ ), are integrated into a heteroscedastic reconstruction loss. This combined uncertainty $\beta_{TS} = w \cdot \beta_T + (1-w) \cdot \beta_S$ adaptively modulates the supervision signal, enabling more robust distractor suppression and geometric modeling. The loss function incorporates both Structural Similarity Index Measure (SSIM) [50] and Mean Squared Error (MSE) terms, reformulated as:

$L_{\text{Multi-uncer}} = L_{\text{SSIM}}(P(r), \hat{P}(r)) + \frac{L_{\text{MSE}}(P(r), \hat{P}(r))}{2\beta_{TS}^2(r)} + \lambda \log \beta_{TS}(r)$

The patch-based SSIM loss is crucial for this framework. Both $\beta_S$ and $\beta_T$ rely on capturing locally correlated structural or semantic variations, which necessitate spatially consistent supervision. The SSIM loss provides spatially smooth and context-aware gradients, promoting the learning of coherent uncertainty distributions and preventing overfitting to isolated noise. This significantly enhances the reliability of uncertainty estimations and the robustness of the training process (2604.17965).

Experimental Validation and Implications

Extensive experiments on the On-the-go [34] and RobustNeRF [36] datasets demonstrate that MU-GeNeRF consistently outperforms existing GeNeRFs, such as ReTR [20] and MuRF [51], in handling dynamic scenes and effectively suppressing distractors. Quantitatively, on the On-the-go dataset (after fine-tuning), MU-GeNeRF achieves a PSNR of 21.77 and SSIM of 0.669 on the 'Corner' scene, and a PSNR of 20.33 and SSIM of 0.564 on 'Patio-High'. These results surpass ReTR (PSNR 19.76, SSIM 0.611 on 'Corner') and MuRF (PSNR 14.03, SSIM 0.363 on 'Corner') by significant margins (2604.17965).

While the performance of MU-GeNeRF is marginally lower than scene-specific distractor-free NeRFs like NeRF on-the-go 34, it operates under a fundamentally different paradigm. MU-GeNeRF is a feed-forward GeNeRF designed for cross-scene generalization, requiring significantly less fine-tuning (approximately 60K iterations or 2 hours) compared to NeRF on-the-go's 250K iterations (48 hours) for per-scene optimization (2604.17965). This demonstrates a strong trade-off in efficiency and generalizability versus the absolute peak performance achievable with extensive per-scene optimization.

The ablation studies further underscore the importance of each component. Removing uncertainty modeling altogether leads to a marked drop in reconstruction quality. The individual contributions of Source-view and Target-view Uncertainties are also critical. Relying solely on Target-view Uncertainty can misclassify inconsistent static structures as distractors, while Source-view Uncertainty alone cannot precisely localize transient distractors in the target view (2604.17965). The synergistic combination of both uncertainties under the heteroscedastic framework effectively mitigates these failure modes, leading to more robust distractor suppression and accurate geometric modeling.

Conclusion

MU-GeNeRF provides a robust framework for overcoming the challenges posed by transient distractors in Generalizable Neural Radiance Fields. By decoupling reconstruction errors through distinct Source-view and Target-view Uncertainties, it enables accurate distractor identification and adaptive suppression. While the framework relies on a robust supervision mechanism rather than explicit removal of distractors, its efficacy in generalizing across diverse dynamic scenes with significantly reduced per-scene optimization overhead presents tangible advantages. Future research could explore explicit distractor modeling and tighter integration with dynamic scene understanding techniques to enhance both reliability and interpretability, particularly in scenarios with high occlusion.

Markdown Report Issue