Revisiting Depth Representations for Feed-Forward 3D Gaussian Splatting (2506.05327v1)

Published 5 Jun 2025 in cs.CV

Abstract: Depth maps are widely used in feed-forward 3D Gaussian Splatting (3DGS) pipelines by unprojecting them into 3D point clouds for novel view synthesis. This approach offers advantages such as efficient training, the use of known camera poses, and accurate geometry estimation. However, depth discontinuities at object boundaries often lead to fragmented or sparse point clouds, degrading rendering quality -- a well-known limitation of depth-based representations. To tackle this issue, we introduce PM-Loss, a novel regularization loss based on a pointmap predicted by a pre-trained transformer. Although the pointmap itself may be less accurate than the depth map, it effectively enforces geometric smoothness, especially around object boundaries. With the improved depth map, our method significantly improves the feed-forward 3DGS across various architectures and scenes, delivering consistently better rendering results. Our project page: https://aim-uofa.github.io/PMLoss

Summary

The paper presents PM-Loss, a novel regularization that uses pre-trained pointmaps as pseudo-ground truth to refine 3D Gaussian centers for enhanced rendering.
It leverages efficient algorithms such as Umeyama alignment and Chamfer distance to mitigate geometric inaccuracies caused by depth discontinuities.
Experimental results on benchmarks like DTU and RealEstate10K demonstrate improved metrics (PSNR, SSIM, LPIPS) and reduced artifacts near object boundaries.

This paper (2506.05327) addresses a key limitation in feed-forward 3D Gaussian Splatting (3DGS) models: the degradation of novel view synthesis quality due to geometric inaccuracies. Traditional feed-forward 3DGS often relies on unprojecting predicted depth maps to create 3D points (which become Gaussian centers). However, depth maps frequently exhibit discontinuities, particularly at object boundaries. When these discontinuities are unprojected, they lead to fragmented and inaccurate 3D Gaussian representations, negatively impacting the final rendered images.

The authors propose a novel regularization technique called PM-Loss (PointMap Loss) to mitigate this issue. PM-Loss leverages the geometric prior provided by pointmaps, which are 2D-to-3D representations regressed from input images using pre-trained large-scale 3D reconstruction models (like Vision Transformers). Unlike depth maps, pointmaps tend to capture smoother and more coherent 3D geometry, especially around complex boundaries.

The core idea of PM-Loss is to use the pointmap generated by a pre-trained model as a pseudo-ground truth 3D structure to supervise the 3D Gaussian centers predicted by the feed-forward 3DGS network during training. The process involves several steps:

A pre-trained pointmap regression model (e.g., VGGT [wang2025vggt]) takes the input image(s) and camera poses to generate a dense pointmap, where each pixel corresponds to a 3D point in world space.
The feed-forward 3DGS model predicts a depth map for the input image(s), which is then unprojected using camera intrinsics and extrinsics to obtain a set of 3D points representing the Gaussian centers.
Crucially, because both the unprojected depth points and the pointmap points originate from the same image pixels, there's a natural one-to-one correspondence between them.
Due to potential discrepancies in scale, rotation, or translation between the coordinate systems of the predicted 3DGS centers and the pointmap points (even if both are ostensibly in world space), an alignment step is necessary. The authors use the efficient Umeyama algorithm [88573], which leverages the one-to-one correspondence to find the optimal similarity transformation (scale, rotation, translation) between the two point clouds.
After alignment, a single-directional Chamfer distance is computed between the set of predicted 3DGS centers and the aligned pointmap points. This $L_{\text{PM}}$ loss penalizes deviations of predicted Gaussian centers from the pointmap prior, effectively encouraging the 3DGS model to learn a geometry closer to the smoother pointmap structure. The formulation is $L_{\text{PM}}(X_{\text{3DGS}}, X'_{\text{PM}}) = \frac{1}{N_{\text{total\_pts}}} \sum_{\mu \in X_{\text{3DGS}}} \min_{p' \in X'_{\text{PM}}} \|\mu - p'\|_2^2$ , where $X_{\text{3DGS}}$ are the predicted centers and $X'_{\text{PM}}$ are the aligned pointmap points. This 3D nearest-neighbor Chamfer loss is shown to be more effective than a 2D pixel-aligned depth loss.
The total training loss is a combination of the standard rendering loss (e.g., MSE + LPIPS) and the proposed $L_{\text{PM}}$ , weighted by a coefficient $\lambda_{\text{PM}}$ .

For practical implementation, the authors recommend generating the pointmaps offline using a pre-trained model (like VGGT-1B), which takes around 0.3 seconds per scene. These pre-computed pointmaps are then loaded during training. The online computational overhead of PM-Loss itself is minimal, primarily comprising the Umeyama alignment (very fast, ~0.9 ms) and the Chamfer distance calculation (~64.1 ms for a typical scene with ~450k points). This makes PM-Loss efficient to integrate into existing training pipelines without significantly slowing them down. The memory overhead during training with offline pointmap generation is also shown to be modest (~0.96GB added VRAM for DepthSplat).

The effectiveness of PM-Loss is demonstrated by applying it to fine-tune two representative feed-forward 3DGS models, MVSplat [chen2024mvsplat] and DepthSplat [xu2024depthsplat], on large-scale datasets like DL3DV [ling2024dl3dv] and RealEstate10K [zhou2018stereo]. Experiments show consistent improvements in novel view synthesis metrics (PSNR, SSIM, LPIPS), particularly in challenging view extrapolation scenarios where geometric accuracy at boundaries is critical (Figure 1). Qualitative comparisons of the predicted 3D Gaussians also show reduced floating artifacts and cleaner borders when PM-Loss is used (Figures 5, 6). Quantitative evaluation on the DTU dataset [jensen2014large] confirms improved geometric quality (lower Accuracy, Completeness, and Overall Chamfer Distance for the predicted point clouds) across varying numbers of input views (Table 2).

The authors discuss that the choice of the pre-trained pointmap model affects performance; using a higher-quality model like VGGT [wang2025vggt] yields slightly better results than Fast3R [yang2025fast3r], although both improve upon baselines without PM-Loss. This highlights a limitation: the PM-Loss's effectiveness is bounded by the quality of the pointmap prior, and errors in the pointmap can propagate.

Compared to alternative methods that attempt to directly integrate pointmap features into the 3DGS network architecture (e.g., using specialized Gaussian heads), PM-Loss offers advantages in efficiency. Architectures that directly process pointmap features might require larger models and potentially complex/slow test-time pose alignment for optimal performance (Table A.3), whereas PM-Loss acts purely as a training-time regularizer and adds no inference overhead.

In summary, PM-Loss is presented as a practical, efficient, and plug-and-play method to improve the geometric quality of feed-forward 3DGS models by leveraging the prior knowledge from pre-trained pointmap regression models, specifically targeting the issue of depth-induced discontinuities at object boundaries.