Papers
Topics
Authors
Recent
2000 character limit reached

SparSplat: Efficient 3D Reconstruction

Updated 12 January 2026
  • SparSplat is a family of methods for efficient 3D reconstruction and novel view synthesis using Gaussian splatting under sparse-view and dynamic scene conditions.
  • It integrates fast per-pixel 2D Gaussian splatting with neural multi-view stereo and dynamic deformation fields for robust segmentation and rendering.
  • Benchmark evaluations demonstrate state-of-the-art performance with significant speed improvements and high-quality reconstructions for both static and dynamic scenarios.

SparSplat encompasses a family of methods and architectures for efficient, high-fidelity 3D reconstruction and novel view synthesis under sparse-view and dynamic scene conditions. While multiple works employ the term, two major frameworks dominate: (1) SparSplat for generalizable multi-view reconstruction via fast per-pixel 2D Gaussian Splatting (Jena et al., 4 May 2025), and (2) Splatography—alternatively titled SparSplat—for sparse, dynamic multi-view deformation and segmentation with 3D Gaussian Splatting (Azzarelli et al., 7 Nov 2025). Both approaches are rooted in the Gaussian Splatting paradigm but target distinct problem regimes: static/sparse reconstructions versus dynamic, unconstrained scenes common in resource-limited filmmaking.

1. Methodological Foundations of SparSplat

SparSplat (Jena et al., 4 May 2025) addresses the challenge of reconstructing high-quality 3D scene geometry and photorealistic novel views from a small set of calibrated images, extending prior Multi-View Stereo (MVS) work. The pipeline integrates image-driven, generalizable neural MVS with pixel-aligned 2D Gaussian Splatting (2DGS) for surface representation, achieving direct, feed-forward inference without per-scene optimization.

Key pipeline steps:

  • Input Representation: Takes NN sparse input images {Ii}i=1N\{I_i\}_{i=1}^N with known intrinsics KiK_i and poses PiP_i; supports optional enrichment with DINOv2 monocular features and MASt3R pairwise correspondences.
  • Feature Extraction: Employs a shared Feature Pyramid Network (FPN) backbone to encode per-image features, optionally concatenated with DINOv2 (384-dim) and MASt3R (24-dim/pair) to improve matching under wide-baseline or low-texture conditions.
  • Homography Warping: For target view PtP_t, features fif_i are warped via plane-sweep homographies HitH_{i\to t} across DD planes, providing geometric alignment for MVS and attribute regression.
  • Cost Volume and Depth Regression: Fused warped features build a cost volume C(u,v,d)C(u,v,d), processed by a 3D CNN for probability volume p(u,v,d)p(u,v,d) and per-pixel depth Dt(u,v)=ddp(u,v,d)D_t(u,v)=\sum_d d\cdot p(u,v,d).
  • 2DGS Surface Element Regression: A pixel-aligned branch regresses per-pixel splat attributes (planar scale sR2s\in\mathbb R^2, quaternion qR4q\in\mathbb R^4, base opacity α\alpha, color cR3c\in\mathbb R^3) via a light-weight CNN/MLP, with 3D center reprojected from depth.
  • Rendering and Reconstruction: For a target novel view, splats are composited via depth sorting and alpha blending; for surface mesh extraction, depth maps at input camera poses are fused by TSDF (voxel size \approx1.5 mm), with mesh extraction via marching cubes.

Splatography SparSplat (Azzarelli et al., 7 Nov 2025) extends the GS paradigm to dynamic, multi-view-sparse scenarios, decoupling foreground (dynamic) and background (static/quasi-static) scene components using sparse, per-view binary masks and independent deformation fields.

Pipeline overview:

  • Initialization: Coarse 3D Gaussian point cloud via MipSplatting, resolving to \approx50,000 points for model economy.
  • Foreground/Background Split: Projecting Gaussians into image and mask space, sets GfG_f (foreground) and GbG_b (background) are defined—background points must lie outside the mask in at least one view.
  • Canonical Pre-Training (t=0): GfG_f and GbG_b optimized for mask-specific loss functions, suppressing floaters and cross-bleeding while anchoring to initial frame data only.
  • Dynamic Fine-Tuning: Hex-plane networks model deformation fields—Λf\Lambda_f yields translation, rotation, and color drift for GfG_f; Λb\Lambda_b encodes only translation for GbG_b. Temporal Gaussian opacity profile imparts transient, time-localized support for dynamic points.
  • Reference-free Foreground Densification: Points with large temporal displacement are cloned to ensure high-frequency motion is adequately captured.
  • Dynamic Rendering and Evaluation: Alpha-blended splatting with per-point opacity and covariance, evaluating via photometric, SSIM, and LPIPS losses.

2. Mathematical Formulation of Gaussian Splatting

The unifying principle of both frameworks is the use of Gaussian primitives as surface elements:

  • 2D Gaussian Splatting (2DGS) (Jena et al., 4 May 2025): Each “splat” is a planar anisotropic Gaussian in the rendered image plane, defined with local tangent coordinates. The Gaussian profile for pixel x=(x,y)x=(x,y) is G(x)=exp(12[u(x)2+v(x)2])G(x)=\exp(-\frac{1}{2}[u(x)^2+v(x)^2]), with per-splat opacity modulation α(x)=αG(x)\alpha'(x)=\alpha\cdot G(x). Color contribution is α(x)c\alpha'(x)\cdot c.
  • Compositing: For each pixel, C(x)=i=1Mαi(x)cij<i(1αj(x))C(x)=\sum_{i=1}^M \alpha'_i(x) c_i \prod_{j<i}(1-\alpha'_j(x)), where MM is the number of overlapping splats.
  • 3D Gaussian Splatting (3DGS) (Azzarelli et al., 7 Nov 2025): Each primitive gig_i is parameterized by its center xiR3x_i\in\mathbb R^3, scale siR3s_i\in\mathbb R^3, quaternion riR4r_i\in\mathbb R^4, color ciR3c_i\in\mathbb R^3, and opacity σi\sigma_i. Covariance Σi=RiSiSiRi\Sigma_i=R_i S_i S_i^\top R_i^\top, with RiR_i the rotation matrix. The alpha-blending rule follows as in 2DGS but with projected covariances into image space.
  • Dynamic Profile: For Splatography, opacity is parameterized as a peaked function, σi(t)=hiexp[ωi2tμi2]\sigma_i(t) = h_i\cdot \exp[-\omega_i^2 |t-\mu_i|^2], with hih_i, ωi\omega_i, μi\mu_i learned per Gaussian, providing temporal localization.

3. Loss Functions, Training Regimes, and Supervision

For static and dynamic SparSplat:

  • Static/Feed-Forward SparSplat (Jena et al., 4 May 2025): Multi-objective loss Lk=Lmse+λsLssim+λpLperc+λdLd+λnLn+λdepthLdepthL^k = L_{mse} + \lambda_s L_{ssim} + \lambda_p L_{perc} + \lambda_d L_d + \lambda_n L_n + \lambda_{depth} L_{depth} per stage. Key losses include:
    • LmseL_{mse}: Mean-squared error image loss.
    • LssimL_{ssim}: Structural similarity loss.
    • LpercL_{perc}: LPIPS perceptual loss.
    • LdepthL_{depth}: Per-pixel absolute depth.
    • LdL_d: Depth-distortion loss (mip-NeRF style), concentrating opacity along rays.
    • LnL_n: Splat normal alignment.
    • Loss aggregation across coarse-to-fine stages with learned weights.
  • Dynamic Splatography (Azzarelli et al., 7 Nov 2025):
    • Canonical Stage: Foreground loss employs virtual background color blending to prevent floaters, Lf=[MI+(1M)B][αfIf+(1αf)B]2\mathcal{L}_f = \|[M^*\odot I^* + (1-M^*)\odot B] - [\alpha_f\odot I_f + (1-\alpha_f)\odot B]\|_2. Background loss uses blurred ground-truth outside mask to suppress bleed-in, Lb=[(1M)I+MI~b]Ib2\mathcal{L}_b = \|[(1-M^*)\odot I^* + M^*\odot\tilde I^b] - I_b\|_2.
    • Dynamic Fine-Tuning: Panoptic photometric loss, Lphoto=tItIt(Gf(t),Gb(t))2\mathcal{L}_{photo} = \sum_t \|I_t^* - I_t(G'_f(t),G'_b(t))\|_2.
    • Regularization: Opacity peaks and bandwidths regularized as Lh,ω=λh1hi+λωωi\mathcal{L}_{h,\omega} = \lambda_h|1-h_i| + \lambda_\omega|\omega_i| to ensure temporal consistency.

Supervision demands are minimized in Splatography, requiring only a single binary mask per view at t=0t=0.

4. Network Architectures and Feature Integration

  • Feed-Forward SparSplat (Jena et al., 4 May 2025) employs a four-scale, 256-channel-wide FPN as backbone (ImageNet-pretrained), with depth and attribute branches comprising 3D convolutions (cost volume, 64→8 planes coarse-fine) and compact per-pixel CNN/MLP heads for splat regression.
  • Feature Augmentation: DINOv2 monocular features (384-dim) and MASt3R pairwise correspondences (24-dim/view-pair) are concatenated for richer cross-view alignment.
  • Splatography utilizes MipSplat or coarse volumetric techniques for point initialization, followed by hex-plane neural networks for modeling spatial-temporal deformations across separated foreground/background splat populations.

5. Quantitative Benchmarks and Comparative Performance

Detailed results are reported against established baselines:

3-view Surface Reconstruction on DTU

Method Mean Chamfer Distance (mm) Inference Time
Colmap 1.52 ~10 s (MVS+fusion)
SparseNeuS 1.27 ~30 s
VolRecon 1.38 ~31 s
ReTR 1.17 ~37 s
GeoTransfer 1.12 ~32 s
UfoRecon 1.05 ~66 s
SparSplat 1.04 0.8 s

SparSplat achieves state-of-the-art mean Chamfer distance with nearly two orders of magnitude lower inference time compared to volumetric or implicit models.

Novel-View Synthesis on DTU (3 input views)

Method PSNR SSIM LPIPS
IBRNet 26.04 0.917 0.191
MVSNeRF 26.63 0.931 0.168
ENeRF 27.61 0.957 0.089
MVSGaussian 28.21 0.963 0.076
SparSplat 28.33 0.938 0.073

Generalization is demonstrated on BlendedMVS and Tanks & Temples datasets, with visual mesh quality and novel-view synthesis on par or superior to competing approaches.

3D ViVo Dataset (Dynamic Cinema Scenes)

Model PSNR (full) PSNR (mask) Model Size (MB)
4D-GS 14.22 21.22 134–320
STG 13.83 21.72 134–320
SC-GS 13.81 20.57 134–320
SparSplat 16.05 24.80 60

2.5D DyNeRF Dataset

Model PSNR (full) PSNR (mask) Model Size (MB)
4D-GS 24.51 26.45 34–119
Waveplanes 24.32 26.56 34–119
ITGS 21.95 24.93 34–119
SparSplat 24.41 26.28 47

Qualitative reconstructions for dynamic, semi-transparent props (fire, smoke) and unmasked foreground segmentation consistently outperform prior GS-based methods.

6. Ablative and Analytical Results

  • Replacing 2DGS with 3DGS (as in MVSGaussian) in SparSplat leads to suboptimal TSDF fusions and disconnected surfaces (Jena et al., 4 May 2025).
  • Removal of depth supervision in SparSplat increases mean Chamfer distance by roughly 71.8%.
  • Feature ablation shows that concatenating DINOv2 monocular features and MASt3R correspondences reduces mean Chamfer distance by 0.85% and 11.1% respectively.
  • Splatography’s sparse mask protocol achieves competitive segmentation and dynamic reconstruction without dense mask supervision, at significant parameter and storage compression.

7. Limitations and Applicability

SparSplat methods, while notably efficient and practical for sparse-input and dynamic scenes, exhibit limitations including:

  • Residual difficulty disentangling large rapid foreground motions from view-dependent radiance/shadows, particularly in ambiguous lighting or occluded regions.
  • Absence of explicit depth priors in Splatography may yield geometric ambiguities in cases of extreme sparsity.
  • For static scenes, the per-pixel 2DGS regression in feed-forward SparSplat is directly limited by the fidelity of neural feature alignment and cannot benefit from post-hoc, per-scene optimization.
  • Splatography’s background segmentation presumes relatively static, separable backgrounds—a plausible assumption in filmmaking, but potentially restrictive elsewhere.

Despite these caveats, SparSplat architectures demonstrate state-of-the-art performance for fast, high-fidelity scene reconstruction and novel view synthesis in both static and dynamic, sparse-view scenarios (Jena et al., 4 May 2025, Azzarelli et al., 7 Nov 2025).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to SparSplat.