Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 81 tok/s

Gemini 2.5 Pro 44 tok/s Pro

GPT-5 Medium 22 tok/s Pro

GPT-5 High 25 tok/s Pro

GPT-4o 81 tok/s Pro

Kimi K2 172 tok/s Pro

GPT OSS 120B 434 tok/s Pro

Claude Sonnet 4 37 tok/s Pro

2000 character limit reached

MS-GS: Multi-Appearance Sparse-View 3D Gaussian Splatting in the Wild (2509.15548v1)

Published 19 Sep 2025 in cs.CV

Abstract: In-the-wild photo collections often contain limited volumes of imagery and exhibit multiple appearances, e.g., taken at different times of day or seasons, posing significant challenges to scene reconstruction and novel view synthesis. Although recent adaptations of Neural Radiance Field (NeRF) and 3D Gaussian Splatting (3DGS) have improved in these areas, they tend to oversmooth and are prone to overfitting. In this paper, we present MS-GS, a novel framework designed with Multi-appearance capabilities in Sparse-view scenarios using 3DGS. To address the lack of support due to sparse initializations, our approach is built on the geometric priors elicited from monocular depth estimations. The key lies in extracting and utilizing local semantic regions with a Structure-from-Motion (SfM) points anchored algorithm for reliable alignment and geometry cues. Then, to introduce multi-view constraints, we propose a series of geometry-guided supervision at virtual views in a fine-grained and coarse scheme to encourage 3D consistency and reduce overfitting. We also introduce a dataset and an in-the-wild experiment setting to set up more realistic benchmarks. We demonstrate that MS-GS achieves photorealistic renderings under various challenging sparse-view and multi-appearance conditions and outperforms existing approaches significantly across different datasets.

Summary

The paper presents a novel framework that combines semantic depth alignment with multi-view geometry-guided supervision to improve 3D reconstruction from sparse views.
The paper utilizes SfM-based semantic segmentation and monocular depth estimation to generate dense point clouds for accurate Gaussian splatting initialization.
The paper demonstrates significant improvements in rendering fidelity and computational efficiency, achieving higher PSNR and consistent appearance over prior methods.

Multi-Appearance Sparse-View 3D Gaussian Splatting in the Wild: A Technical Analysis of MS-GS

Introduction

MS-GS introduces a robust framework for novel view synthesis and scene reconstruction from sparse, multi-appearance image collections using 3D Gaussian Splatting (3DGS). The method addresses two major challenges: (1) insufficient geometric support due to sparse Structure-from-Motion (SfM) point clouds, and (2) photometric inconsistencies arising from images captured under varying conditions. MS-GS leverages semantic depth alignment for dense point cloud initialization and multi-view geometry-guided supervision to regularize appearance and geometry, outperforming prior NeRF and 3DGS-based approaches in both quantitative and qualitative metrics.

Figure 1: With 20 input views, DNGS and FSGS produce overly smooth rendering in regions lacking support from sparse point cloud initialization, while MS-GS recovers fine details and coherent structure.

Semantic Depth Alignment for Dense Initialization

The initialization of 3DGS is critical, especially in sparse-view scenarios where conventional SfM point clouds are insufficiently dense and lack coverage. MS-GS proposes a semantic depth alignment strategy that combines monocular depth estimation with local semantic segmentation, anchored by SfM points. The process involves:

Projecting SfM points onto each image and extracting semantic regions using a point-prompted segmentation model (e.g., Segment Anything).
Iteratively refining semantic masks to ensure sufficient SfM point support and merging overlapping masks for completeness.
Aligning monocular depth within each semantic region to SfM depth via least-squares optimization of scale and shift parameters.
Back-projecting aligned depths to generate a dense, semantically meaningful point cloud for 3DGS initialization.

This approach mitigates the ambiguities and noise inherent in global monocular depth alignment, providing reliable geometric anchors for subsequent Gaussian densification and pruning.

Figure 2: Overview of depth prior initialization in MS-GS, showing semantic mask extraction and depth alignment for dense point cloud construction.

Figure 3: Visualizations of rendering with different point cloud initializations, highlighting the superiority of semantic alignment over sparse and image-level alignment.

Multi-View Geometry-Guided Supervision

To address overfitting and appearance inconsistency in sparse, multi-appearance settings, MS-GS introduces multi-view geometry-guided supervision:

Appearance Modeling: Decomposes appearance into per-image embeddings (capturing global variations) and per-Gaussian feature embeddings (encoding canonical scene appearance). An MLP fuses these embeddings to predict Gaussian colors.
Virtual View Creation: Interpolates camera poses between training views to generate virtual viewpoints, using SLERP for rotation and linear interpolation for translation and FOV.
3D Warping Supervision: Establishes pixel-wise correspondences between training and virtual views via backprojection and forward projection, enforcing appearance consistency with explicit pixel loss and occlusion masking.
Semantic Feature Supervision: Applies feature-level loss using VGG-extracted feature maps, mapped via 3D warping correspondences, to regularize appearance at a coarser semantic level and handle occlusions.

This dual-level supervision constrains both fine-grained and coarse appearance, promoting multi-view consistency and reducing artifacts.

Figure 4: Overview of multi-view geometry-guided supervision, illustrating virtual view creation and 3D warping for pixel and feature loss computation.

Experimental Results and Ablation

MS-GS is evaluated on three benchmarks: Sparse Mip-NeRF 360, Sparse Phototourism, and a newly introduced Sparse Unbounded Drone dataset. The method demonstrates substantial improvements over state-of-the-art NeRF and 3DGS variants, especially in perceptual metrics (LPIPS, DSIM) and rendering speed.

Quantitative Results: On the Sparse Unbounded Drone dataset, MS-GS achieves a PSNR of 19.87, SSIM of 0.580, LPIPS of 0.322, and DSIM of 0.096, outperforming the best prior method by 2.54 dB in PSNR and reducing perceptual error by over 65%.
Qualitative Results: MS-GS consistently reconstructs fine details and maintains appearance coherence across novel views, whereas prior methods exhibit floaters, artifacts, and inconsistent radiance.

Ablation studies confirm the complementary benefits of semantic dense initialization and multi-view supervision. Dense initialization alone improves PSNR by 0.8 dB, while the addition of geometry-guided losses further boosts performance.

Figure 5: Qualitative comparison of novel view synthesis across datasets, with MS-GS excelling at detailed structure and appearance consistency.

Implementation Considerations

Computational Efficiency: MS-GS is lightweight, requiring >3× less GPU time than Wild-GS and rendering at 373 FPS.
Scalability: Semantic depth alignment scales linearly with the number of images, unlike pairwise methods such as DUSt3R.
Drop-in Enhancement: The dense initialization strategy can be integrated into other 3DGS pipelines with minimal code changes, universally improving performance.
Robust Evaluation: The proposed coordinate alignment protocol ensures fair in-the-wild evaluation by disentangling training and testing camera registration, with rotation-aware Procrustes alignment reducing pose errors by an order of magnitude.
Figure 6: Illustration of coordinate alignment, showing transformation between camera coordinate systems for robust evaluation.

Limitations and Future Directions

MS-GS is not designed to handle transient objects, which remain a challenge in sparse-view settings due to increased uncertainty and ambiguity. Future work may explore dynamic scene modeling, improved appearance disentanglement, and integration with large-scale, unbounded scene representations.

Conclusion

MS-GS establishes a new baseline for multi-appearance, sparse-view 3D Gaussian Splatting, combining semantic depth alignment and multi-view geometry-guided supervision to achieve superior reconstruction fidelity and appearance consistency. The method is computationally efficient, broadly applicable, and sets the stage for further advances in robust, in-the-wild novel view synthesis.