DepthSplat: 3D Gaussian Splatting Framework

Updated 20 April 2026

The paper introduces DepthSplat, a feed-forward 3D Gaussian Splatting framework that integrates monocular and multi-view depth estimation for novel-view synthesis.
It employs a dual-branch architecture using a ResNet/Swin Transformer and a frozen ViT to predict per-pixel depths and generate 3D Gaussian splats for photorealistic rendering.
Experiments demonstrate state-of-the-art improvements in PSNR, SSIM, and LPIPS, with PM-Loss regularization enhancing boundary consistency and geometric accuracy.

DepthSplat is a feed-forward 3D Gaussian Splatting (3DGS) framework that bridges monocular/multi-view depth estimation and novel-view synthesis, enabling high-fidelity, real-time scene reconstruction from sparse input imagery and known camera poses. The method integrates depth map prediction, unprojection to 3D, and parameterized Gaussian splat rendering within a unified deep learning pipeline, with specific architectural provisions for both supervised and unsupervised settings. Key innovations include fusing monocular depth priors with multi-view feature-matching, leveraging cost volumes, and incorporating specialized training objectives that enhance boundary consistency and geometric accuracy.

1. Core Methodology and Pipeline

DepthSplat begins with a pipeline that predicts per-pixel depths for each RGB input image, fuses these multi-view depths with monocular priors, and unprojects the result to generate a 3D point cloud, where each point serves as the center of a parameterized Gaussian. The architecture comprises two main branches:

Multi-view branch: Extracts plane-sweep stereo features using shared ResNet and a Swin Transformer with alternating self- and cross-attention to compute a volumetric cost volume across multiple views.
Monocular branch: Utilizes a frozen, pre-trained ViT (Depth-Anything V2) to provide strong per-pixel depth priors, especially in textureless or occluded regions.

Features from these branches are concatenated and processed by a regression U-Net, producing dense depth predictions via softmax over depth bins and expectation. Hierarchical matching and scale refinement are applied to enhance resolution. The predicted depths are then unprojected using camera intrinsics and extrinsics, yielding the centers of 3D Gaussians. Additional small U-Net heads predict each Gaussian’s opacity, covariance, and color (Xu et al., 2024, Long et al., 7 Jan 2026).

Rendering is achieved by differentiable point splatting, where Gaussian densities are composited using front-to-back or back-to-front alpha compositing to synthesize photorealistic novel views. Losses are applied either on rendered images (photometric, LPIPS), on depths (ℓ₁+gradient loss on inverse depths), or both.

2. Addressing Depth Discontinuities and Geometric Consistency

A major practical challenge arises from depth discontinuities at object boundaries, resulting in fragmented or noisy geometry when depths are unprojected. DepthSplat variants and follow-ups address this through regularizer design.

A specialized training loss, PM-Loss (PointMap-Loss), harnesses a dense pseudo-ground-truth pointmap generated by a pre-trained transformer as a geometric prior. These pointmaps are aligned with the predicted point clouds via Umeyama alignment, then a single-directional Chamfer distance is minimized from each predicted Gaussian center to the nearest point in the aligned pointmap:

$L_{\mathrm{PM}} = \frac{1}{N_{\mathrm{pts}}} \sum_{\mu \in X_\mathrm{3DGS}} \min_{p' \in X'_\mathrm{PM}} \|\mu - p'\|^2_2$

This constrains predicted geometry to be more spatially smooth and consistent at occlusion boundaries. The total loss objective combines photometric, perceptual (LPIPS), and PM-Loss terms, with PM-Loss weighted by a small coefficient (λ_PM=0.005). This regularizer is applied only during training and incurs no inference overhead (Shi et al., 5 Jun 2025).

3. Architectural and Algorithmic Details

The DepthSplat backbone consists of:

Shared ResNet for early image encoding, with downsampling for computational efficiency.
Multi-view Swin Transformer for cross-view feature matching, constructing a cost volume with D depth candidates (typically D=64).
Frozen Depth-Anything V2 ViT module for strong monocular priors, upsampled to match the cost volume stride.
Feature fusion by channel concatenation to form input to a regression U-Net; output logits over depth bins resolve depth via softmax+expectation.
Hierarchical scale-refinement: a coarse-to-fine depth refinement conducted in two stages (stride 1/8 then locally finer).
Lightweight heads (MLP or small U-Net) predict per-pixel Gaussian parameters: opacity α, covariance Σ (typically diagonal), and color c.

Implementation uses PyTorch, xFormers for efficient attention, and fused CUDA kernels for plane-sweep warping. The inference path is strictly feed-forward and highly parallelized (Xu et al., 2024).

4. Training Regimes and Losses

Training can be conducted in:

Depth-supervised mode: Applies an L_depth objective on the inverse predicted and ground-truth depths:

$L_{\mathrm{depth}} = \alpha\,|1/D_{\mathrm{pred}} - 1/D_{\mathrm{gt}}| + \beta\,|\nabla(1/D_{\mathrm{pred}}) - \nabla(1/D_{\mathrm{gt}})|$

Unsupervised ("Gaussian-only") mode: The rendering loss,

$L_{\mathrm{gs}} = \sum_{m=1}^M \left[ \mathrm{MSE}(I^{m}_{\mathrm{render}}, I^{m}_{\mathrm{gt}}) + \lambda\,\mathrm{LPIPS}(I^{m}_{\mathrm{render}}, I^{m}_{\mathrm{gt}}) \right]$

supervises only photometric consistency in rendered novel views. This mode also acts as unsupervised pre-training for the depth network, leveraging unlabeled, multi-view video data.

Unsupervised pre-training has been shown to regularize and improve cross-dataset transfer, as demonstrated by significant improvements in downstream geometry estimation after subsequent fine-tuning (Xu et al., 2024).

5. Experimental Results and Comparative Performance

DepthSplat demonstrates state-of-the-art and robust results across multiple datasets and tasks:

Dataset	Metric	MVSplat	DepthSplat
RealEstate10K	PSNR	26.39	27.44
RealEstate10K	SSIM	0.869	0.887
RealEstate10K	LPIPS	0.128	0.119
DL3DV (4-view)	PSNR	21.63	22.82
ScanNet (AbsRel)	-	-	0.044

Notably, DepthSplat achieves PSNR gains (1–2 dB) over preceding baselines including pixelNeRF, GPNR, MuRF, pixelSplat, MVSplat, and TranSplat for novel-view synthesis. On ScanNet (two-view depth estimation), it reduces AbsRel to 0.044, outperforming DeepV2D and UniMatch. In real-time scenarios, DepthSplat reconstructs from 12 views at high resolution (512×960) in 0.6 seconds on a single A100 GPU (Xu et al., 2024, Shi et al., 5 Jun 2025, Long et al., 7 Jan 2026).

Using PM-Loss regularization further boosts PSNR by >2 dB and reduces both SSIM and LPIPS errors, particularly improving object boundaries and point-cloud smoothness (Shi et al., 5 Jun 2025).

6. Extensions: Physical Defocus and Multi-View Consistency

Recent extensions integrate physical defocus modeling and multi-view geometric supervision. The pipeline simulates depth-of-field (DOF) using optics-based defocus kernels (Gaussian, SmoothStep, Polygonal), computes physically accurate defocused images, and introduces corresponding DOF losses. Monocular depth priors are scale-normalized via geometric constraints derived from semi-dense LoFTR feature matches. Combined loss terms include:

In-focus RGB loss (L1 + SSIM) for sharp reconstruction.
Out-of-focus loss penalizing mismatch in simulated defocus.
Geometric consistency and local least-squares depth consistency losses using epipolar geometry and depth ratio matching across correspondences.

This results in further PSNR/SSIM improvement (e.g., +1.13 dB PSNR on Waymo Open Dataset; consistent improvements on Mip-NeRF 360, SS3DM, and other challenging benchmarks) and qualitatively sharper near/far geometry (Deng et al., 13 Nov 2025).

7. Limitations and Future Directions

Limitations of vanilla DepthSplat include under-utilization of multi-view cues due to reliance on a single plane-sweep warp and limited fine-scale resolution owing to cost volume construction at stride 1/4 or 1/8. Boundary artifacts may persist without explicit regularization, and real-time inference assumes known camera poses.

Suggested avenues for further research include:

Integration of pose estimation for end-to-end, pose-free reconstruction.
Joint training with more expressive generative priors, such as NeRF-style decoders.
Extension to dynamic/deforming scenes and tight coupling with SLAM systems.
Adaptive optics and hybrid kernel schemes for handling extreme depth variation.
Memory and compute optimization, as pursued in IDESplat via iterative depth probability boosting and sparse attention, which outperforms DepthSplat in cross-dataset generalization while using substantially fewer parameters and less memory (Long et al., 7 Jan 2026, Deng et al., 13 Nov 2025, Shi et al., 5 Jun 2025, Xu et al., 2024).

DepthSplat and its derivatives remain central baselines and frameworks for generalizable, real-time neural 3D reconstruction, continuing to drive improvements in the broader 3D Gaussian Splatting and neural rendering literature.