- The paper introduces GeoQuery, integrating explicit geometric correspondences into diffusion-based pipelines to mitigate query contamination in sparse-view 3D reconstruction.
- A novel Geometry-Guided Cross-View Attention mechanism replaces corrupted queries with proxy queries from reference images, leading to significant PSNR and FID improvements.
- Extensive experiments demonstrate GeoQuery’s robustness in artifact removal and detail preservation even at extreme sparsity, using as few as three input views.
Geometry-Query Diffusion for Robust Sparse-View 3D Reconstruction
Introduction and Motivation
Sparse-view 3D reconstruction and novel view synthesis (NVS) present significant challenges due to the under-constrained nature of scene geometry and texture when only a few views are available. 3D Gaussian Splatting (3DGS) has emerged as a real-time, high-fidelity explicit representation for novel view rendering. However, it is susceptible to severe rendering artifacts—such as geometric collapse and floating structures—under low observation scenarios. Previous attempts to address these failures, notably through diffusion-based render-and-refine frameworks, depend heavily on the use of multi-view self-attention to aggregate contextual information across source images. These approaches are notably derailed by what is termed "Query Contamination": artifact-prone or corrupted queries from noisy novel-view renderings retrieve non-corresponding features during attention, leading to propagation or even amplification of artifacts. This phenomenon results in hallucinated semantics and inconsistent refinement.
The paper introduces GeoQuery—a geometry-guided diffusion approach designed to resolve the query contamination bottleneck. GeoQuery integrates explicit geometric correspondences between views into the diffusion framework, leveraging a novel Geometry-Guided Cross-View Attention (GCA) mechanism. Instead of querying with potentially corrupted features from the novel view, GCA constructs proxy queries derived directly from reference images, anchored by dense geometric correspondences that are estimated via depth maps and camera poses. These proxy queries then retrieve features in a local neighborhood, mitigating spurious or non-physical matches.
GeoQuery is positioned as a drop-in module for modern diffusion-based pipelines, offering significant boosts in both artifact removal and sparse-view reconstruction, especially under extreme sparsity (e.g., 3 views).
Figure 1: Overview of GeoQuery progressive refinement on sparse training data via 3D Gaussian Splatting and geometry-guided rendering supervision.
Technical Contributions
Explicit Geometry-Guided Cross-View Attention
GeoQuery's core innovation is the GCA module, which pivots from global semantic feature aggregation to a geometry-aligned retrieval strategy. For each target pixel in the artifact-prone novel view, a dense correspondence map—constructed using the predicted depth from the reference and camera projection—locates the spatially aligned pixel in the reference view. Instead of querying with the feature embeddings from the corrupted rendering, the corresponding feature from the reference is used as a proxy query. This approach entirely bypasses unreliable information in the contaminated region and localizes the scope of attention spatially using a k×k window, enhancing geometric faithfulness and suppressing erroneous matches.
The system deploys a learnable blending gate to combine geometry-guided and global attention results adaptively at each pixel, balancing local geometric evidence and global semantic completion.
Figure 2: Qualitative comparisons on artifact removal – GeoQuery versus DIFIX3D+ and ground truth demonstrates improved recovery of fine structure.
Query Contamination and Region-Level Analysis
The study provides a quantitative and qualitative dissection of query contamination effects. Standard diffusion-based refinement methods employing multi-view self-attention are shown to misaggregate features in regions afflicted by artifacts, often propagating errors into otherwise reliable regions. GeoQuery statistically mitigates this degradation, as demonstrated by a region-based PSNR analysis—showing a 4 dB improvement over DIFIX3D+ in high-error regions.
Figure 3: Additional qualitative comparisons further demonstrate the robustness of GeoQuery for artifact removal across diverse scenarios.
Experimental Results
Artifact Removal
GeoQuery is validated on the DL3DV-Benchmark, where it achieves the top performance across all major metrics (PSNR, SSIM, LPIPS, FID) for rendering artifact removal. Specifically, GeoQuery improves PSNR by 1.09 dB and reduces FID by 2.63 compared to DIFIX3D+ under the same conditions.
Sparse-View Novel View Synthesis
On both DL3DV-Benchmark and Mip-NeRF360, GeoQuery consistently surpasses both regularization-based and prior diffusion-based methods (e.g., FSGS, DIFIX3D+) at 3, 6, and 9 input views. Its advantage is most pronounced under extreme sparsity; in the 3-view setting, it outperforms DIFIX3D+ by 0.92 dB PSNR on Mip-NeRF360 and by 0.78 dB on DL3DV-Benchmark.
Figure 4: Same-scene comparison under varying input views, highlighting accuracy and stability of GeoQuery as view sparsity increases.
Robustness, Graceful Degradation, and Ablation Studies
GeoQuery demonstrates not only higher average reconstruction fidelity but also graceful degradation as the number of input views decreases. Ablation studies confirm that the critical performance gains are attributable to the use of geometry-indexed proxy queries and localized attention: substituting rendering-derived queries with geometry-derived proxies increases PSNR and further reduces artifact-induced errors. Window size tuning for the local attention module shows an optimal tradeoff between locality and retrieval ambiguity at k=3.
Figure 5: Extended qualitative results showing the generalization of GeoQuery improvements across the large-scale Mip-NeRF360 dataset.
Figure 6: Extended qualitative results on the DL3DV-Benchmark, emphasizing consistent artifact mitigation and detail preservation.
Figure 7: Summary visual comparisons across both 360-degree and indoor scenes, where GeoQuery produces semantically plausible and geometrically robust renderings beyond current baselines.
Limitations and Implications
While GeoQuery leverages explicit geometric correspondences to bypass query contamination, it inherits failure modes of correspondence estimation—specifically, unreliable operation in textureless or highly specular regions, or in scenarios with extreme viewpoint displacement where depth-based correspondences cannot be reliably computed. In such cases, the framework must revert to generative completion, potentially sacrificing geometric consistency.
Despite these limitations, the introduction of explicit geometry-guided retrieval mechanisms substantially pushes forward the robustness of sparse-view reconstruction pipelines. The practical implications include reliable 3D reconstruction from minimal input, improved rendering consistency for downstream vision and graphics tasks, and a pathway toward integrating more powerful priors as diffusion backbones evolve.
Conclusion
GeoQuery establishes that integrating explicit geometric guidance into the cross-view diffusion framework resolves a longstanding bottleneck in sparse-view 3D reconstruction: the propagation of artifacts due to query contamination in self-attention mechanisms. By employing proxy queries grounded in geometric correspondences and restricting attention to spatially localized windows, the method consistently yields superior artifact removal and robust novel view synthesis, particularly under extreme data sparsity. Its architectural modularity enables seamless adoption within existing diffusion-based pipelines. Future work may pair this geometry-guided approach with more advanced diffusion priors to further mitigate dependency on correspondence quality and extend applicability to more challenging regimes.