Papers
Topics
Authors
Recent
2000 character limit reached

Inpaint360GS: 3D Scene Inpainting Framework

Updated 16 November 2025
  • Inpaint360GS is a 3D scene inpainting framework that uses 3D Gaussian Splatting to enable photorealistic, view-consistent editing of 360° unbounded scenes.
  • It employs object-aware 2D-to-3D mask association and virtual view synthesis to accurately remove multiple objects and handle never-before-seen regions.
  • Quantitative evaluations show state-of-the-art performance with improved PSNR, SSIM, and FID, outperforming prior 2D diffusion and GAN-based methods.

Inpaint360GS is a 3D scene inpainting framework that leverages 3D Gaussian Splatting (3DGS) as its core representation to enable efficient, object-aware, and photorealistic completion of 360° unbounded scenes following multi-object removal. In contrast to diffusion- or GAN-based generative 2D pipelines, Inpaint360GS directly manipulates the explicit Gaussian field, enabling full-resolution, view-consistent editing and scalable handling of large or severely occluded regions in complex multi-view, multi-object environments (Wang et al., 9 Nov 2025). The system introduces object association via 2D-to-3D mask consistency, virtual camera context for unseen region synthesis, and hybrid 3D/2D supervisory strategies, achieving state-of-the-art performance on novel 360° inpainting benchmarks.

1. Problem Setting and Challenges

Inpaint360GS addresses the problem of filling object removal holes in unbounded 360° scene reconstructions, where the scene is given as a set of posed images and object segmentation masks. The method's distinguishing challenge lies in:

  • 3D Object Identification: Unlike front-facing or single-object scenarios, 360° captures contain many objects with occlusions, ambiguous boundary projections, and no natural image plane for mask propagation. Cross-view inconsistent editing arises if per-view segmentations are naively lifted to 3D.
  • Handling Never-Before-Seen (NBS) Regions: Object removal exposes background or other objects that have never been observed, necessitating precise identification and plausible synthesis for inpainting.
  • Multi-object, Large-hole, and Cross-view Consistency: Inpainting must be geometrically and photometrically coherent under arbitrary viewpoint changes and object masking, with scalability to complex and large-scale scenes.

These constraints push beyond previously proposed 3DGS inpainting pipelines, which often assume single-object editing, forward-facing geometry, or rely on 2D diffusion/GAN priors (Huang et al., 2023, Wu et al., 7 Feb 2025, Dahaghin et al., 9 Sep 2025).

2. 3D Gaussian Splatting Scene Representation

The foundation of Inpaint360GS is the explicit 3D Gaussian Splatting (3DGS) representation:

  • Each scene is represented as a collection of NN Gaussians G={gi}i=1NG=\{g_i\}_{i=1}^N, where
    • piR3p_i \in \mathbb{R}^3: 3D center
    • siR3s_i \in \mathbb{R}^3, qiR4q_i \in \mathbb{R}^4: scale and orientation (for covariance Σi\Sigma_i)
    • oiRo_i \in \mathbb{R}: opacity
    • cic_i: color (SH coefficients)
  • Rendering is performed by projecting each Gaussian into image space, alpha-blending along the depth order:

C=iNciαiTi,Ti=j<i(1αj)C = \sum_{i \in \mathcal{N}} c_i\,\alpha_i\,T_i,\qquad T_i = \prod_{j<i} (1-\alpha_j)

This representation eschews implicit volumetric functions or view-dependent MLPs, allowing direct manipulation, fast editing, and efficient optimization during object-aware inpainting. It further enables the aggregation of multi-view 2D cues into a unified, spatially explicit field.

3. Object-aware 2D-to-3D Mask Association

Inpaint360GS introduces a key object management mechanism to solve the multi-view object alignment problem:

  • Per-view segmentation is obtained with a foundation 2D segmenter (e.g., HQSAM (Wang et al., 9 Nov 2025)).
  • For each view, Gaussians are projected onto the image, and a K-means (K=2K=2) clustering is used to differentiate foreground (nearest depth) from background clusters, assigning object labels to 3D anchors.
  • A global object database DID={P1,,PQ}\mathcal{D}_{ID} = \{P_1,\dots,P_Q\} tracks sets of 3D Gaussians belonging to candidate objects.
  • New view-object assignments are matched by Gaussian-set IoU:

GSIoUij=PiPjPiPj\mathrm{GS-IoU}_{ij} = \frac{|P_i \cap P_j|}{|P_i \cup P_j|}

with assignments for IoU above threshold σ=0.2\sigma=0.2.

  • Multi-view consistent object IDs result, enabling precise removal and aggregation of multiple objects, even under occlusion.

Each Gaussian gig_i is endowed with a learnable object identity feature fiRDf_i \in \mathbb{R}^D (typically D=16D=16), facilitating a cross-view cross-entropy object loss and 3D spatial consistency regularization:

LDis=Lobj+λLspace\mathcal{L}_{Dis} = \mathcal{L}_{obj} + \lambda \mathcal{L}_{space}

where λ=5×104\lambda=5\times 10^{-4} (space between KNNs). This distillation step is performed over 2000 Adam steps, freezing non-identity parameters.

4. Virtual View Synthesis and Recursive Conditional Inpainting

For each object removed:

  • The 3D centroid is computed.
  • Principal components of the orbital camera arrangement define a set of virtual viewpoints {pi}i=1L\{p_i\}_{i=1}^L (typically L30L\approx 30), forming a synthetic trajectory around the removed object.
  • For each pose, rendered color CiC_i, depth DiD_i, and a predicted "never-before-seen" (NBS) mask MiM_i (via SAM-Tracking, sometimes after further segmentation with objectness detectors) are generated.
  • These virtual context views allow identification of minimal unseen regions for each object and act as templates for further 2D inpainting.

Recursive conditional 2D inpainting is performed using, for example, LaMa (Wang et al., 9 Nov 2025): latent encodings of adjacent rendered frames are decoded and optimized jointly over a small number of steps, propagating appearance and geometry along the virtual trajectory to maintain cross-view consistency during fill-in of the NBS region.

5. Depth-Guided Gaussian Initialization and Hybrid 3D Supervision

To fill the NBS regions:

  • Missing depth is inpainted by 2D depth-completion in virtual views; the completed color and depth, together with masks, are backprojected to produce a candidate point cloud in the unseen region.
  • New Gaussians are initialized at these 3D locations with colors from inpainted images and default small covariances.
  • All pre-existing Gaussians outside the NBS region are frozen.

Hybrid supervision is performed by:

  • For each (original and inpainted) view, rendering the inpainted reference and computing

L3DInp=(1λ1)M(CinpY^)1+λ1LDSSIM(Cinp,Y^)+λ2LLPIPS(Cinp,Y^,M)\mathcal{L}_{3DInp} = (1-\lambda_1)\|\mathcal{M}\odot(C_{inp}-\hat{Y})\|_1 + \lambda_1 \mathcal{L}_{D-SSIM}(C_{inp},\hat{Y}) + \lambda_2\,\mathcal{L}_{LPIPS}(C_{inp},\hat{Y},\mathcal{M})

where λ1=0.2, λ2=0.005\lambda_1=0.2,\ \lambda_2=0.005, and M\mathcal{M} is the NBS mask. Optimization uses 2000 Adam steps.

6. Experimental Evaluation and Quantitative Results

Inpaint360GS was evaluated on a bespoke 360° inpainting dataset (Wang et al., 9 Nov 2025):

  • Dataset: 11 scenes (7 single-object, 4 multi-object), 100–200 images per scene, with controlled object removal, consistent photometric settings, object masks (via HQSAM + association), and ground truth NBS region masks for quantitative metrics.
  • Metrics:
    • Full-image PSNR: 24.40↑
    • Masked PSNR: 36.29↑
    • Full-image SSIM: 0.837↑
    • Masked SSIM: 0.9886↑
    • Full-image LPIPS: 0.130↓
    • Masked LPIPS: 0.0078↓
    • FID: 35.93↓

Inpaint360GS outperforms baseline methods such as SPIn-NeRF, GScream, AuraFusion360, and Gaussian Grouping, showing notable gains in masked region quality and FID. Ablations confirm that all core modules—including object association, depth guidance, virtual view context, conditional 2D inpainting, and 3D hybrid supervision—contribute independently to both quality and convergence.

7. Limitations, Future Directions, and Significance

Current limitations of Inpaint360GS include incomplete handling of shadows cast by removed objects (residual halo effects), challenges in inpainting highly irregular or fine-grained textures with LaMa-based 2D fill-ins, and lack of efficient, learned diffusion-based multi-view consistency priors. The methodology does not model explicit lighting or shadows and may face difficulties with highly dynamic scenes or complex occlusions outside the training distribution.

Suggested future avenues include:

  • Learned multi-view diffusion priors for more plausible completion in difficult textures.
  • Explicit shadow modeling for complete object-removal effects.
  • Multi-reference fusion to enhance coverage, particularly in environments with large occlusions or non-redundant backgrounds.
  • Semantic priors and 4DGS extensions for handling temporally coherent, dynamic scenes.

Taken together, Inpaint360GS establishes an object-centric, 3D-aware, and scalable inpainting pipeline that outperforms prior art in both geometric and photometric consistency for multi-object, multi-view 360° scene editing. Its modular framework and reliance on explicit, editable representations position it as a foundation for future work in practical 3D scene manipulation, digital content creation, and augmented/virtual reality applications (Wang et al., 9 Nov 2025).

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Inpaint360GS.