QuickSplat: Fast 3D Surface Reconstruction

Updated 6 February 2026

QuickSplat is a data-driven 3D surface reconstruction method that uses learned Gaussian initialization and densification to achieve rapid and accurate indoor scene modeling.
It replaces slow, heuristic-based per-scene optimization with an end-to-end pipeline that jointly refines photometric, depth, occupancy, and distortion losses.
Experimental results show QuickSplat reduces depth errors by up to 48% and is over 300× faster than competing methods while preserving fine scene details.

QuickSplat is a data-driven 3D surface reconstruction method that leverages learned Gaussian initialization and densification to accelerate and enhance large-scale indoor scene modeling from posed multi-view RGB images. By replacing hand-crafted heuristics and slow per-scene optimization, QuickSplat enables high-accuracy surface reconstruction within seconds, addressing the long-standing challenges of under-observed or textureless regions in traditional pipelines (Liu et al., 8 May 2025).

1. Problem Context and Limitations of Prior Approaches

Surface reconstruction from RGB images is foundational to applications in computer vision, graphics, mixed reality, and robotics. Existing methods can be divided into volumetric/implicit-function-based approaches (such as NeRF, UNISURF, and MonoSDF) and surface-oriented Gaussian Splatting methods (e.g., 3DGS, 2DGS, SuGaR, GS2Mesh). Volumetric and implicit-function approaches optimize multilayer perceptrons (MLPs) over hundreds of thousands of scene-specific gradient-descent steps ( $\sim$ 30 min to $>$ 10 h for room-scale), yet remain susceptible to errors and artifacts—particularly floating geometry and holes—in vast, untextured, or occluded regions.

Gaussian Splatting methods speed novel-view synthesis by representing the scene as a sparse set of 3D Gaussian primitives, reducing render time but retaining the bottleneck of per-scene, heuristic-guided densification, which often leaves holes and curved artifacts on flat structures such as walls and ceilings. Heuristic densification (“grow near high-error rays”) is inadequately robust in under-observed or textureless regions and can result in incomplete or inaccurate reconstructions (Liu et al., 8 May 2025).

2. Data-Driven Gaussian Initialization

QuickSplat supplants the standard pipeline’s slow, rule-based initialization with a learned, end-to-end dense Gaussian initialization. The initializer network $\theta_I$ predicts a well-distributed set of 2D Gaussians directly from a sparse Structure-from-Motion (SfM) point cloud via a single feed-forward pass.

2.1. Surface Representation

Each scene is represented as a set of $N$ Gaussian “splats” $\mathcal{G}=\{\mathbf{g}_i\}_{i=1}^{N}$ , where each $\mathbf{g}_i \in \mathbb{R}^{14}$ encodes center ( $\mathbf{g}_c\in\mathbb{R}^3$ ), scale (defining the $3 \times 3$ covariance $\Sigma_i$ ), rotation (quaternion), opacity ( $o_i \in [0,1]$ ), and diffuse color ( $c_i \in [0,1]^3$ ).
Rendering is performed by projecting each Gaussian along a ray $x$ to an elliptical disk via $g_i^{2D}(\mathbf{u}(x)) = \exp\left(-\frac{1}{2}(\mathbf{u}(x)-\boldsymbol\mu_i)^\top\Sigma_i^{-1}(\mathbf{u}(x)-\boldsymbol\mu_i)\right)$ and alpha-blending the contributions along the ray.

2.2. Decoder-Style Initialization

The sparse SfM point cloud is voxelized onto a 3D grid (voxel side length $v_d = 4$ cm); occupied voxels are assigned learnable 64-dimensional features.
A sparse 3D U-Net encoder–decoder with 4 down/upsampling layers predicts densified features, guided by an occupancy head at each upsampling stage.
A small MLP decodes each feature into Gaussians (two splats per voxel); positions are decoded with $\mathbf{g}_c=\mathbf{v}_c + R(2\sigma(\mathbf{z})-\mathbf{1})$ , $R=4v_d$ .

2.3. Training Losses

The initializer loss combines photometric consistency ( $\mathcal{L}_c$ ), depth accuracy ( $\mathcal{L}_d$ ), occupancy ( $\mathcal{L}_\mathrm{occ}$ ), normal alignment ( $\mathcal{L}_n$ ), and a distortion regularizer ( $\mathcal{L}_\mathrm{dist}$ ), with a training objective:

$\mathcal{L}(\theta_I) = \mathcal{L}_c + \mathcal{L}_d + \mathcal{L}_\mathrm{occ} + 0.01\,\mathcal{L}_n + 10\,\mathcal{L}_\mathrm{dist}$

Training is conducted on ScanNet++ (902 training scenes), directly optimizing the quality of scene initialization for robust downstream refinement.

3. Iterative Densification and Joint Optimization

After initialization, QuickSplat employs $T=5$ learned “optimization” steps, each iteration refining scene geometry and introducing new splats in under-modeled areas.

3.1. Pipeline Overview

Rendering Gradients: For all training images, compute and aggregate the gradient of the total loss (photometric, depth, distortion) with respect to each voxel feature.
Learned Densifier ( $f_D$ ): A sparse 3D CNN ingests current features, rendering gradients, and the step index $t$ , predicting candidate new voxel features $\widehat{\mathcal{G}_t}$ . Up to $n(t)=20000/2^t$ new voxels are importance-sampled at each step.
Learned Optimizer ( $f_O$ ): Combine $\mathcal{G}_t$ and $\widehat{\mathcal{G}_t}$ , concatenate zero gradients for new features, and refine all features via a sparse 3D U-Net. The update $\Delta\overline{\mathcal{G}_t} \in [-1,1]$ is bounded, with $\mathcal{G}_{t+1} = \overline{\mathcal{G}_t} + \Delta\overline{\mathcal{G}_t}$ .
Loss and Gradient Detachment: Densifier and optimizer are jointly supervised at each step; gradients are detached between steps, following meta-learning protocols.

The learned densifier eliminates heuristic densification, enabling robust, data-driven expansion in regions where photometric information is limited or ambiguous.

4. Experimental Protocol and Performance

4.1. Training and Inference Details

Dataset: ScanNet++ (902 training scenes, 20 test scenes).
Image Resolution: $360 \times 540$ for training; $720 \times 1080$ for testing.
Optimizer: Adam ( $1 \times 10^{-4}$ ), with batch accumulation from 100 random views per loop.
Network Scale: $\sim68$ M parameters; 2 Gaussians per $4$ cm voxel.
Training Regime: Train $\theta_I$ for 3 days (RTX A6000), then jointly train $(\theta_D,\theta_O)$ for 5 steps per loop.
Optional Fine-Tuning: 2,000 steps of vanilla SGD on all splats ( $\sim$ 26 s)—not strictly necessary for high performance.

4.2. Quantitative Results

QuickSplat’s performance is benchmarked against SuGaR, 2DGS, GS2Mesh, and MonoSDF. Metrics on ScanNet++ (20 test scenes) are summarized:

Method	AbsErr (m)	Acc@2cm	Acc@5cm	Acc@10cm	Chamfer	Time
SuGaR	0.2061	0.1157	0.2774	0.4794	0.2078	3,130s
2DGS	0.1127	0.4021	0.6027	0.7422	0.2420	1,796s
GS2Mesh	0.1212	0.4028	0.6039	0.7406	0.2012	973s
MonoSDF	0.0569	0.5774	0.8006	0.8850	0.1450	>10h
QuickSplat w/o opt	0.0732	0.5263	0.7674	0.8583	0.1461	26s
QuickSplat w/ opt	0.0578	0.5783	0.8035	0.8887	0.1347	124s

Key outcomes:

Depth error reductions up to 48% versus SuGaR.
8 $\times$ speedup over 2DGS; QuickSplat w/ opt is $>$ 300 $\times$ faster than MonoSDF with matching accuracy.
Even without opt, QuickSplat is 70 $\times$ faster than MonoSDF and more accurate than all Gaussian-splatting baselines.

5. Qualitative Analysis and Observations

QuickSplat consistently reconstructs flat wall regions with high geometric fidelity, eliminating “floating” or curved patches common in baselines. The learned initializer and densifier allow rapid closure of holes in under-observed or textureless areas, such as white corners, and prevent the generation of flying fragments. Fine scene details (e.g., ladders, chair legs), often missed by monocular-depth–guided volumetric methods, are preserved due to the flexible, spatially adaptive Gaussian representation (Liu et al., 8 May 2025).

A plausible implication is that learned Gaussian representations generalize better across scene structures, particularly in challenging illumination and sparsity regimes.

6. Limitations and Future Directions

Mirrors and specular surfaces continue to challenge QuickSplat, resulting in occasional ghost geometry due to the reliance on photometric losses, which favor reconstructing reflections rather than true geometry. The method assumes static scenes and does not model dynamic or semi-dynamic objects. While QuickSplat greatly accelerates the reconstruction pipeline (to seconds or minutes), it is not yet real-time.

Prospective research directions include:

Extension to dynamic/semi-dynamic settings via per-frame Gaussian updates.
Integration into live SLAM pipelines, such as SplaTAM, for on-the-fly 3D reconstruction.
Incorporation of more powerful, expressive data-driven priors such as learned shape grammars, potentially permitting effective initialization from even sparser observations.

7. Summary and Impact

QuickSplat demonstrates that learning both initialization and densification of Gaussian splats, as opposed to relying on per-scene gradient descent and manual heuristics, provides state-of-the-art performance for 3D surface reconstruction of large indoor environments. The method achieves orders-of-magnitude acceleration, significantly reduces depth errors in challenging regions, and produces high-fidelity geometry. Its modular design supports both rapid batch and incremental scene updates, providing a viable platform for future research in scene reconstruction and real-time spatial understanding (Liu et al., 8 May 2025).

Markdown Report Issue Upgrade to Chat

References (1)

QuickSplat: Fast 3D Surface Reconstruction via Learned Gaussian Initialization (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to QuickSplat.