View-Conditional Inpainting Model

Updated 23 February 2026

View-conditional inpainting models are generative approaches that complete missing regions in images by leveraging multi-view cues to maintain both local plausibility and 3D geometric consistency.
They integrate methodologies like diffusion-based joint optimization, reference-guided alignment, and latent-space fusion to accurately merge information across multiple viewpoints.
Applications span 3D scene editing and novel view synthesis, although challenges remain in computational cost, depth estimation, and handling complex scene dynamics.

A view-conditional inpainting model is a class of generative methods designed for filling missing or masked regions in images or scenes, such that the completed content is not only locally plausible but also consistent across multiple viewpoints. Unlike single-view inpainting, view-conditional inpainting exploits geometric and appearance cues from multiple or reference views and explicitly addresses cross-view consistency, geometry alignment, and often full 3D structure, enabling applications in multi-view editing, novel view synthesis, and unconstrained scene completion. A broad range of recent work, including NeRF- and 3DGS-based frameworks, conditional diffusion models, video-based 3D object filling, and geometry-guided refinement approaches, reflects a rapidly advancing field that leverages advances in generative modeling, radiance field representations, and multi-view geometry.

1. Key Concepts and Challenges in View-Conditional Inpainting

View-conditional inpainting seeks to generate completions that are not only plausible for each isolated frame but that also maintain consistency across all views of a static or dynamic scene. Fundamental challenges addressed by such models include:

View Consistency: Ensuring that the inpainted content describes the same 3D structure or surface appearance when seen from different viewpoints, avoiding view-specific hallucinations or “drift.”
Geometry Alignment: Maintaining correspondence between inpainted RGB content and scene geometry (e.g., depth or normals), so that compositing results in realistic, distortion-free 3D reconstructions.
Cross-view Information Fusion: Effectively aggregating cues from visible (unmasked) regions, reference frames, or computed geometry to achieve seamless completions even under sparse or wide-baseline observation conditions.
Handling Wide-baseline and Pose Ambiguity: Addressing the increased ambiguity and potential for geometric inconsistency as the separation or motion between views increases, often under incomplete supervisory signals.

Initial approaches based purely on per-view, 2D inpainting (e.g., LaMa, Stable Diffusion Inpaint) fail because inpainted regions can exhibit drastic appearance inconsistency and poor geometric alignment across views. This drives the development of new frameworks with explicit view-conditioning or multi-view optimization (Chen et al., 2024, Zhao et al., 2022, Cao et al., 2024, Salimi et al., 18 Feb 2025).

2. Methodological Frameworks for View-Conditional Inpainting

A diverse set of frameworks has been developed, often combining geometric modeling, cross-view alignment, and generative priors. Notable methodologies include:

Multi-view Joint Optimization under Diffusion Priors: As in MVIP-NeRF, a NeRF’s parameters are optimized directly by aligning synthetic renderings with the noise-prediction scores of a diffusion model, using Score Distillation Sampling (SDS) computed jointly across multiple views to enforce cross-view appearance and geometry consistency (Chen et al., 2024).
Reference-guided Alignment and Inpainting: Models such as 3DFill first align reference and target views through self-supervised 3D projection and 2D local warping, then use conditional generative networks to inpaint while leveraging aligned content, substantially boosting performance for large holes and strong viewpoint changes (Zhao et al., 2022).
Latent-space Fusion and Geometry-aware Diffusion: Geometry-aware diffusion models (Salimi et al., 18 Feb 2025) avoid blending artifacts by fusing multi-view information in latent noise space, using per-view geometric cues (e.g., projections via reference depth), and learning attention-based aggregation to maintain sharpness and view-consistency.
Video Diffusion for Multi-view Consistency: ObjFiller-3D transforms multi-view inpainting into a video denoising and completion task, using adapted video diffusion models to inpaint a sequence of projected views treated as video frames, and leveraging flow matching losses for temporal (and thus cross-view) coherence (Feng et al., 25 Aug 2025).
Pixel-space Projection with Lightweight Geometric Alignment: VEIGAR removes expensive initial 3D reconstructions, instead projecting from an anchor view into all others using depth and camera estimation, performing explicit 2D inpainting, and then refining via a geometry-aligned 3D Gaussian Splatting stage (Do et al., 13 Jun 2025).
Test-time Geometry-guided Refinement and Fusion: IMFine employs geometry prior extraction from pruned 3DGS scenes, warps inpainted reference images to all target views, and performs multi-view diffusion-based refinement with space-time attention, adapting the refinement network at test-time for each scene to boost coherence across even unconstrained camera trajectories (Shi et al., 6 Mar 2025).

3. Mathematical Formulation and Objective Functions

View-conditional inpainting models formulate learning objectives that explicitly couple multi-view appearance and geometry. Common formulations include:

Score Distillation Sampling (SDS) Loss:

$\mathcal{L}_{\mathrm{SDS}}(\theta) = \mathbb{E}_{t,\,\epsilon}\left[ w(t)\,\big\|\epsilon_\phi^\omega(x_t;y,t) - \epsilon\big\| \right]$

where $x_t$ are noised renderings, $\epsilon_\phi^\omega$ is the diffusion model’s predicted noise, and $w(t)$ weights each noise level $t$ (Chen et al., 2024).

Multi-view SDS Aggregation:

$\nabla_\theta \mathcal{L}_\text{multi-view} = \sum_{i=1}^N\, \mathbb{E}_{t,\epsilon}\left[ w(t)\,\big(\epsilon_\phi^\omega(z_t^i; m^i, y, t) - \epsilon^i\big)\, \frac{\partial z^i}{\partial x^i}\, \frac{\partial x^i}{\partial \theta} \right]$

Ensures all views are coupled through shared gradients.

Normal SDS Loss:

$\nabla_\theta \mathcal{L}_\mathrm{normal} = \mathbb{E}_{t,\epsilon}\left[ w(t)\,\big(\epsilon_\phi^\omega(z_t;m,y,t) - \epsilon\big)\, \frac{\partial z}{\partial n}\, \frac{\partial n}{\partial \theta} \right]$

Promotes geometry (normal map) consistency (Chen et al., 2024).

Reference-based Flow Matching Loss:

$\mathcal{L}_\mathrm{FM}(\theta) = \mathbb{E}_{t,\,x\sim p_t}\left\| u_\theta(t,x|\text{cond}) - u^*(t,x) \right\|_2^2$

for continuous diffusion velocity fields in video inpainting (Feng et al., 25 Aug 2025).

Latent-space Fusion and Confidence Mask Supervision: Geometry-aware methods employ hierarchical confidence masks—for front-face, back-face, shadows—combining multiple diffusion streams and supervising against reference-derived ground truths (Salimi et al., 18 Feb 2025).
Scale-invariant Depth and Photometric Losses: To enforce pixel-level and geometric consistency, explicit supervised losses over depth, normal, and color channels are also used, including scale-invariant depth penalties and perceptual scores such as LPIPS (Do et al., 13 Jun 2025, Seo et al., 11 Jul 2025).

4. Strategies for Cross-View Consistency

Achieving high-fidelity, cross-view consistent completions involves:

Joint Optimization Across Views: Enabling gradients from multiple views to jointly update the 3D latent or radiance field parameters reduces per-view drift and ensures every view “votes” on a shared solution (Chen et al., 2024, Salimi et al., 18 Feb 2025).
Differentiable 3D Warping: Aligning reference inpaints into target views using current geometry, and employing warping-based or flow-based blending, allows selective trust in regions with accurate geometry and appearance (Zhao et al., 2022, Seo et al., 11 Jul 2025, Shi et al., 6 Mar 2025).
Space-time and Cross-attention Mechanisms: Diffusion-based models often replace vanilla attention layers with block-sparse spatial and temporal attention, or introduce reference-key/value attention, enhancing cross-view matching and feature propagation (Cao et al., 2024, Shi et al., 6 Mar 2025, Feng et al., 25 Aug 2025).
Confidence-weighted or Poisson-blended Fusion: Regions of high geometric or appearance correspondence are weighted more heavily in model objectives and warping procedures, mitigating adverse effects from residual misalignment or artifacts (Seo et al., 11 Jul 2025, Salimi et al., 18 Feb 2025).

5. Empirical Evaluation and Results

Multiple works benchmark view-conditional inpainting using standard appearance- and geometry-oriented metrics over masked regions:

Appearance Metrics: Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index (SSIM), Fréchet Inception Distance (FID), and Learned Perceptual Image Patch Similarity (LPIPS) are widely adopted.
Geometry Metrics: L2 depth error, normal consistency, and feature-matching consistency (LoFTR) assess geometric fidelity.

Empirical results across diverse datasets confirm that view-conditional inpainting models consistently outperform sequential or independent 2D inpainting in both appearance and geometry metrics, especially under large spatial baselines or complex occlusions:

Model	PSNR ↑	LPIPS ↓	Training Time (hrs) ↓	FID ↓
MVIP-NeRF	0.181–0.507 (LPIPS)	1.499–0.021 (depth error)	–	–
3DFill	34.97 (sm)	0.0146 (sm)	5.5 fps	–
VEIGAR	16.385	0.524	0.42	–
ObjFiller-3D	26.6	0.19	~10 min	–
IMFine	19.67	0.2685	~1	149.52
RePaintGS	–	0.2431	–	70.73
Geometry-aware Diffusion	28.59	0.05	0.92–1.5	–

(“sm”: small-mask/large-mask, best values bolded per reported regime, see (Chen et al., 2024, Zhao et al., 2022, Do et al., 13 Jun 2025, Feng et al., 25 Aug 2025, Shi et al., 6 Mar 2025, Seo et al., 11 Jul 2025, Salimi et al., 18 Feb 2025).)

Ablation studies show that eliminating joint optimization across views or geometry-aware cues sharply reduces performance, and that components such as reference-based attention, warping-based fusion, and geometry-guided refinement yield marked, quantifiable benefits.

6. Applications and Limitations

The principal application domains for view-conditional inpainting models are:

3D Scene and Object Editing: Robust removal, insertion, or synthesis of objects in 3D scenes, maintaining visual and geometric coherence from any viewpoint.
Novel View Synthesis and Freeviewpoint Video: Generating unseen perspectives of realistic scenes or edited content, supporting immersive media and AR/VR.
High-fidelity 3D Reconstruction: Bridging inpainting and radiance field–based multi-view reconstruction, especially in unconstrained, sparse, or incomplete observations.

Limitations and open challenges include:

Computation Cost: Many frameworks, especially those with scene adaptation or iterative optimization, remain slow, with fine-tuning or test-time adaptation required for best performance (Shi et al., 6 Mar 2025).
Dependency on Depth and Segmentation Quality: Failures in monocular depth estimation or mask segmentation propagate errors, limiting achievable fidelity in challenging scenes (Shi et al., 6 Mar 2025, Seo et al., 11 Jul 2025).
Handling of Non-Lambertian Effects or Dynamic Scenes: Most practical frameworks assume static, Lambertian surfaces, while extending to full BRDF modeling or temporally coherent inpainting remains an active area (Chen et al., 2024, Shi et al., 6 Mar 2025).

7. Future Directions and Research Opportunities

Ongoing and prospective research themes include:

Parameter-efficient Fine-tuning: Leveraging hypernetworks, adapter modules, or LoRA to reduce per-scene adaptation time and increase deployment scalability (Feng et al., 25 Aug 2025).
End-to-End Geometry-Inpainting Models: Tightening the integration between geometry estimation and inpainting, potentially via unified architectures that learn scene and completion jointly (Shi et al., 6 Mar 2025).
Dynamic and Relightable Scene Inpainting: Extending priors to non-static scenes, learned lighting and shadow priors, and spatio-temporal diffusion for dynamic content (Chen et al., 2024).
Explicit Handling of Pose-Free and Few-View Regimes: Further refinement of methods for sparse-view or pose-agnostic settings—essential for many real-world capture pipelines (Cao et al., 2024, Lu et al., 30 Oct 2025, Salimi et al., 18 Feb 2025).
Improved Confidence Estimation and Adaptive Fusion: Automatic modulation of attention and confidence in fusion steps (e.g., view-conditioned weighting) for increased robustness (Seo et al., 11 Jul 2025, Salimi et al., 18 Feb 2025).

The view-conditional inpainting paradigm continues to evolve beyond per-frame, per-view completion, focusing on holistic, globally consistent synthesis under diverse conditions and with explicit treatment of geometry and appearance priors. This marks a significant frontier in 3D computer vision and generative modeling.