GeoQuery: Geometry-Query Diffusion for Sparse-View Reconstruction

Published 12 May 2026 in cs.CV | (2605.12399v1)

Abstract: 3D Gaussian Splatting (3DGS) has emerged as a prominent paradigm for 3D reconstruction and novel view synthesis. However, it remains vulnerable to severe artifacts when trained under sparse-view constraints. While recent methods attempt to rectify artifacts in rendered views using image diffusion models, they typically rely on multi-view self-attention to retrieve information from reference images. We observe that this mechanism often fails when the rendered novel views output by 3DGS are heavily corrupted: damaged query features lead to erroneous cross-view retrieval, resulting in inconsistent rendering refinement. To address this, we propose GeoQuery, a geometry-guided diffusion framework that integrates generative priors with explicit geometric cues via a novel Geometry-guided Cross-view Attention (GCA) mechanism. First, by leveraging predicted depth maps and camera poses, we construct a geometry-induced correspondence field to sample reference features, forming a geometry-aligned proxy query that replaces the corrupted rendering features. Furthermore, we design a new cross-view feature aggregation pipeline, in which we restrict the cross-view attention to a local window around each proxy query to effectively retrieve useful features while suppressing spurious matches. GeoQuery can be seamlessly integrated into existing diffusion-based pipelines, enabling robust reconstruction even under extreme view sparsity. Extensive experiments on sparse-view novel view synthesis and rendering artifact removal demonstrate the effectiveness of our approach.

Abstract PDF Upgrade to Chat

Authors (7)

Summary

The paper introduces GeoQuery, integrating explicit geometric correspondences into diffusion-based pipelines to mitigate query contamination in sparse-view 3D reconstruction.
A novel Geometry-Guided Cross-View Attention mechanism replaces corrupted queries with proxy queries from reference images, leading to significant PSNR and FID improvements.
Extensive experiments demonstrate GeoQuery’s robustness in artifact removal and detail preservation even at extreme sparsity, using as few as three input views.

Geometry-Query Diffusion for Robust Sparse-View 3D Reconstruction

Introduction and Motivation

Sparse-view 3D reconstruction and novel view synthesis (NVS) present significant challenges due to the under-constrained nature of scene geometry and texture when only a few views are available. 3D Gaussian Splatting (3DGS) has emerged as a real-time, high-fidelity explicit representation for novel view rendering. However, it is susceptible to severe rendering artifacts—such as geometric collapse and floating structures—under low observation scenarios. Previous attempts to address these failures, notably through diffusion-based render-and-refine frameworks, depend heavily on the use of multi-view self-attention to aggregate contextual information across source images. These approaches are notably derailed by what is termed "Query Contamination": artifact-prone or corrupted queries from noisy novel-view renderings retrieve non-corresponding features during attention, leading to propagation or even amplification of artifacts. This phenomenon results in hallucinated semantics and inconsistent refinement.

The paper introduces GeoQuery—a geometry-guided diffusion approach designed to resolve the query contamination bottleneck. GeoQuery integrates explicit geometric correspondences between views into the diffusion framework, leveraging a novel Geometry-Guided Cross-View Attention (GCA) mechanism. Instead of querying with potentially corrupted features from the novel view, GCA constructs proxy queries derived directly from reference images, anchored by dense geometric correspondences that are estimated via depth maps and camera poses. These proxy queries then retrieve features in a local neighborhood, mitigating spurious or non-physical matches.

GeoQuery is positioned as a drop-in module for modern diffusion-based pipelines, offering significant boosts in both artifact removal and sparse-view reconstruction, especially under extreme sparsity (e.g., 3 views).

Figure 1: Overview of GeoQuery progressive refinement on sparse training data via 3D Gaussian Splatting and geometry-guided rendering supervision.

Technical Contributions

Explicit Geometry-Guided Cross-View Attention

GeoQuery's core innovation is the GCA module, which pivots from global semantic feature aggregation to a geometry-aligned retrieval strategy. For each target pixel in the artifact-prone novel view, a dense correspondence map—constructed using the predicted depth from the reference and camera projection—locates the spatially aligned pixel in the reference view. Instead of querying with the feature embeddings from the corrupted rendering, the corresponding feature from the reference is used as a proxy query. This approach entirely bypasses unreliable information in the contaminated region and localizes the scope of attention spatially using a $k\times k$ window, enhancing geometric faithfulness and suppressing erroneous matches.

The system deploys a learnable blending gate to combine geometry-guided and global attention results adaptively at each pixel, balancing local geometric evidence and global semantic completion.

Figure 2: Qualitative comparisons on artifact removal – GeoQuery versus DIFIX3D+ and ground truth demonstrates improved recovery of fine structure.

Query Contamination and Region-Level Analysis

The study provides a quantitative and qualitative dissection of query contamination effects. Standard diffusion-based refinement methods employing multi-view self-attention are shown to misaggregate features in regions afflicted by artifacts, often propagating errors into otherwise reliable regions. GeoQuery statistically mitigates this degradation, as demonstrated by a region-based PSNR analysis—showing a 4 dB improvement over DIFIX3D+ in high-error regions.

Figure 3: Additional qualitative comparisons further demonstrate the robustness of GeoQuery for artifact removal across diverse scenarios.

Experimental Results

Artifact Removal

GeoQuery is validated on the DL3DV-Benchmark, where it achieves the top performance across all major metrics (PSNR, SSIM, LPIPS, FID) for rendering artifact removal. Specifically, GeoQuery improves PSNR by 1.09 dB and reduces FID by 2.63 compared to DIFIX3D+ under the same conditions.

Sparse-View Novel View Synthesis

On both DL3DV-Benchmark and Mip-NeRF360, GeoQuery consistently surpasses both regularization-based and prior diffusion-based methods (e.g., FSGS, DIFIX3D+) at 3, 6, and 9 input views. Its advantage is most pronounced under extreme sparsity; in the 3-view setting, it outperforms DIFIX3D+ by 0.92 dB PSNR on Mip-NeRF360 and by 0.78 dB on DL3DV-Benchmark.

Figure 4: Same-scene comparison under varying input views, highlighting accuracy and stability of GeoQuery as view sparsity increases.

Robustness, Graceful Degradation, and Ablation Studies

GeoQuery demonstrates not only higher average reconstruction fidelity but also graceful degradation as the number of input views decreases. Ablation studies confirm that the critical performance gains are attributable to the use of geometry-indexed proxy queries and localized attention: substituting rendering-derived queries with geometry-derived proxies increases PSNR and further reduces artifact-induced errors. Window size tuning for the local attention module shows an optimal tradeoff between locality and retrieval ambiguity at $k=3$ .

Figure 5: Extended qualitative results showing the generalization of GeoQuery improvements across the large-scale Mip-NeRF360 dataset.

Figure 6: Extended qualitative results on the DL3DV-Benchmark, emphasizing consistent artifact mitigation and detail preservation.

Figure 7: Summary visual comparisons across both 360-degree and indoor scenes, where GeoQuery produces semantically plausible and geometrically robust renderings beyond current baselines.

Limitations and Implications

While GeoQuery leverages explicit geometric correspondences to bypass query contamination, it inherits failure modes of correspondence estimation—specifically, unreliable operation in textureless or highly specular regions, or in scenarios with extreme viewpoint displacement where depth-based correspondences cannot be reliably computed. In such cases, the framework must revert to generative completion, potentially sacrificing geometric consistency.

Despite these limitations, the introduction of explicit geometry-guided retrieval mechanisms substantially pushes forward the robustness of sparse-view reconstruction pipelines. The practical implications include reliable 3D reconstruction from minimal input, improved rendering consistency for downstream vision and graphics tasks, and a pathway toward integrating more powerful priors as diffusion backbones evolve.

Conclusion

GeoQuery establishes that integrating explicit geometric guidance into the cross-view diffusion framework resolves a longstanding bottleneck in sparse-view 3D reconstruction: the propagation of artifacts due to query contamination in self-attention mechanisms. By employing proxy queries grounded in geometric correspondences and restricting attention to spatially localized windows, the method consistently yields superior artifact removal and robust novel view synthesis, particularly under extreme data sparsity. Its architectural modularity enables seamless adoption within existing diffusion-based pipelines. Future work may pair this geometry-guided approach with more advanced diffusion priors to further mitigate dependency on correspondence quality and extend applicability to more challenging regimes.