Geometry-Guided Inpainting Module

Updated 2 December 2025

Geometry-guided inpainting modules are techniques that inject geometric information like depth maps, surface normals, and camera poses to enforce structural consistency and sharp boundaries.
They utilize methods such as attention-based feature propagation and depth-image warping to ensure high-fidelity restorations and reliable cross-view consistency, as indicated by improved PSNR, SSIM, and LPIPS metrics.
By segmenting the workflow into geometry extraction, propagation, and guided synthesis, these modules enable robust editing, object removal, and multi-view inpainting in complex and ill-posed scenes.

A geometry-guided inpainting module refers to any architectural or algorithmic component that injects explicit or implicit geometric structure—such as scene depth, surface normals, 3D keypoints, camera poses, or segmentation boundaries—into the inpainting process. This guidance enforces structural and cross-view consistency, sharpens object boundaries, and enables semantically and geometrically plausible completion in ill-posed regions. Geometry guidance spans a spectrum: from low-level features (depth maps, edges) to high-level constructs (semantic masks, 3D scene representations), and appears in both 2D and 3D inpainting frameworks.

1. Geometry-Guided Inpainting: Architectural Principles

Geometry-guided inpainting modules universally target the core challenge of geometric consistency—especially under large masks, multi-view scenarios, or semantic edits. Architectures are typically modular, segmenting the workflow into geometry extraction, propagation, and geometry-guided synthesis or optimization.

In DiGA3D (Pan et al., 1 Jul 2025), geometry guidance begins with appearance and geometry representations derived from rendered views and monocular depth estimation. A two-stage process propagates information:

Coarse Stage: Diffusion U-Net latent features from select reference views are propagated across views using an attention-based mechanism (AFP), enforcing coherent embeddings.
Fine Stage: Depth-image-based warping conveys explicit texture and depth cues from reference to target, supporting multi-control conditioning (e.g., via Texture-Geometry Score Distillation Sampling, TG-SDS) when optimizing the underlying 3D Gaussian cloud.

Elsewhere, approaches such as VEIGAR (Do et al., 13 Jun 2025) and GeoFill (Zhao et al., 2022) exploit pixel-space geometric priors, reprojecting anchor-view inpainted content into other views via refined depth and camera pose for direct input to inpainting backbones, enforcing sub-pixel alignment.

In the context of stereo or video inpainting (IGGNet (Li et al., 2022), XR face video (Lohesara et al., 17 Aug 2025)), geometry guidance is realized through epipolar-aware attention modules or dense semantic keypoints, enabling models to propagate and preserve structure even when portions of both views/frames are missing.

2. Mathematical Formulations and Training Losses

Geometry-guided modules commonly employ mathematically explicit losses that couple appearance and geometric consistency.

In DiGA3D (Pan et al., 1 Jul 2025), the TG-SDS loss builds upon the classical Score Distillation Sampling (SDS) gradient:

$\nabla_\theta \mathcal{L}_{\mathrm{TG\text{-}SDS}} = \mathbb{E}_{t,\epsilon}\!\left[w(t)\,\left(\epsilon_\phi(I^i_t;\,m_i,y,t,\,C'_i,\,D'_i)-\epsilon\right)\,\tfrac{\partial I_i}{\partial\theta}\right].$

This incorporates masked renders, warped textures $C'_i$ , and warped depths $D'_i$ as multi-control conditional cues in diffusion model learning.

VEIGAR (Do et al., 13 Jun 2025) introduces a scale-invariant depth loss:

$\mathcal{L}_{SI} = \frac{1}{n}\sum_{i=1}^{n}\bigl(\log D^\text{render}_i - \log D^\text{predict}_i\bigr)^2 - \frac{1}{n^2}\Bigl(\sum_{i=1}^{n}(\log D^\text{render}_i - \log D^\text{predict}_i)\Bigr)^2$

which obviates per-view scale-and-shift calibration.

SplatFill (Dahaghin et al., 9 Sep 2025) employs scale-free depth bin losses and object-aware contrastive objectives to tie 3D splat placement to monocular geometry and semantic boundaries.

Stereo video modules like IGGNet (Li et al., 2022) leverage geometry-aware cost volumes and soft attention along epipolar lines, in conjunction with $L_1$ and GAN objectives, to regularize pixel-level correspondence and global appearance.

Propagation of geometric information across views or modalities is achieved through:

Attention Feature Propagation (AFP): Cross-attention fuses latent features from reference views into all targets, balancing reliance on reference versus self-view (DiGA3D (Pan et al., 1 Jul 2025)).
Epipolar/Disparity-Aware Attention: IGGNet (Li et al., 2022) builds cost volumes along epipolar lines for stereo, avoiding explicit disparity calculation and instead learning soft correspondences.
Pixel-Space Warping/Reprojection: VEIGAR (Do et al., 13 Jun 2025), GeoFill (Zhao et al., 2022), and TransFill (Zhou et al., 2021) use depth prediction and camera pose to map pixels/features from source/anchor to target, with downstream learning refining color, shape, and alignment.
Warped Feature Conditioning: After warping the inpainted reference and its depth, the resulting features are concatenated with image and mask inputs for the inpainting network (e.g., multi-control diffusion in DiGA3D).
Iterative/Alternating Guidance: Stereo/temporal consistency is improved by alternately updating views or frames, shrinking holes, and propagating feature guidance (IGGNet ICG, (Li et al., 2022)).

4. Reference View/Region Selection and Conditioning

Reference selection is critical for effective geometric propagation and minimizing long-range errors:

Clustering in Pose Space: In DiGA3D (Pan et al., 1 Jul 2025), input camera centers are clustered (K-means) and cluster centroids select reference views, ensuring spatial coverage of the scene and minimizing propagation distance.
Density and Distribution: K (number of references) is a tunable parameter; increased K generally improves PSNR and LPIPS, as shown in empirical ablations, but incurs additional computational cost.
Mask Extraction: For 3D inpainting, test-time mask extraction via geometry-aware object mask dilation and Scene-aware masking (using learned Gaussian opacity) identifies inpainting targets while avoiding over-extension into seen regions (IMFine (Shi et al., 6 Mar 2025)).

5. Experimental Assessment and Ablation Studies

Geometry-guided modules demonstrate measurable gains in fidelity, consistency, and aesthetic quality. Representative results include:

Component	PSNR	SSIM	LPIPS
Baseline	20.45	0.57	0.30
+AFP	20.66	0.57	0.29
+AFP+TG-SDS	20.71	0.58	0.28

(From DiGA3D, (Pan et al., 1 Jul 2025); full 3DGS pipeline, SPIn-NeRF object removal.)

Additional findings:

Geometry guidance (contours, semantic masks): Increases PSNR by 0.5 dB and SSIM by >0.002 in 2D settings (Xiong et al., 2019).
Pixel-space alignment via geometric priors: Substantially improves full-image FID and patch-level LPIPS for high-res photos (SuperCAF, (Zhang et al., 2022)).
Multi-view geometric conditioning: Improves correspondence consistency (MUSIQ, Corrs) vs. view-independent baselines (Pan et al., 1 Jul 2025, Salimi et al., 18 Feb 2025).

Ablative studies consistently show that disabling geometric warping or cross-attention mechanisms degrades both global consistency and local detail, often causing view-dependent artifacts or blurring.

6. Methodological Variants and Applications

Geometry-guided inpainting modules have been deployed across a wide range of tasks and input modalities:

3D and multi-view scene inpainting: Diffusion-based and Gaussian Splatting methods with direct geometric supervision excel in novel view synthesis, object removal, and unconstrained scene editing (Pan et al., 1 Jul 2025, Do et al., 13 Jun 2025, Shi et al., 6 Mar 2025, Dahaghin et al., 9 Sep 2025, Salimi et al., 18 Feb 2025).
Stereo/temporal video inpainting: Geometry-aware attention, dense landmark guidance, and cyclical update strategies achieve state-of-the-art results for stereo-view consistency and continuous video restoration (Li et al., 2022, Lohesara et al., 17 Aug 2025).
Reference-based single/multi-view inpainting: Depth-reprojection, homography clustering, and appearance-space fusion allow transfers across large baselines, supporting high-fidelity object completion and re-texturing (Zhou et al., 2021, Zhao et al., 2022).
Artist-guided geometric inpainting: Explicit geometric vector fields (splines, contours) can be used in interactive workflows for geometric completion in stereo film conversion or specialized edits, maintaining sharp structures at high computational efficiency (Hocking et al., 2016).

7. Limitations and Future Directions

Limitations of geometry-guided modules are often dictated by the quality and reliability of upstream geometric cues (depth, pose). Failure modes arise with highly inaccurate depth predictors, insufficient reference coverage, or propagation of geometric artifacts through test-time adaptation. Some pipelines necessitate per-scene adaptation (e.g., IMFine (Shi et al., 6 Mar 2025)), which impacts throughput and scalability. There can also be a substantial computational and memory overhead from geometric warping and multi-branch design.

Areas of ongoing or future research include robust self-supervised or multi-modal geometric estimation, improved confidence/risk modeling in geometry cue fusion, further acceleration, and extension to open-domain non-lambertian scenes or scenes with severe occlusions and highly dynamic content.

Geometry-guided inpainting modules, regardless of their specific architectural instantiation, share the common goal of structurally regularizing the solution space by enforcing consistency with estimated or known geometry, yielding state-of-the-art results in inpainting across 2D/3D, single/multi-view, and high-fidelity domains (Pan et al., 1 Jul 2025, Do et al., 13 Jun 2025, Salimi et al., 18 Feb 2025, Zhao et al., 2022, Li et al., 2022, Zhang et al., 2022, Dahaghin et al., 9 Sep 2025, Shi et al., 6 Mar 2025, Hocking et al., 2016, Xiong et al., 2019).