QuickSplat: Fast 3D Surface Reconstruction
- QuickSplat is a data-driven 3D surface reconstruction method that uses learned Gaussian initialization and densification to achieve rapid and accurate indoor scene modeling.
- It replaces slow, heuristic-based per-scene optimization with an end-to-end pipeline that jointly refines photometric, depth, occupancy, and distortion losses.
- Experimental results show QuickSplat reduces depth errors by up to 48% and is over 300× faster than competing methods while preserving fine scene details.
QuickSplat is a data-driven 3D surface reconstruction method that leverages learned Gaussian initialization and densification to accelerate and enhance large-scale indoor scene modeling from posed multi-view RGB images. By replacing hand-crafted heuristics and slow per-scene optimization, QuickSplat enables high-accuracy surface reconstruction within seconds, addressing the long-standing challenges of under-observed or textureless regions in traditional pipelines (Liu et al., 8 May 2025).
1. Problem Context and Limitations of Prior Approaches
Surface reconstruction from RGB images is foundational to applications in computer vision, graphics, mixed reality, and robotics. Existing methods can be divided into volumetric/implicit-function-based approaches (such as NeRF, UNISURF, and MonoSDF) and surface-oriented Gaussian Splatting methods (e.g., 3DGS, 2DGS, SuGaR, GS2Mesh). Volumetric and implicit-function approaches optimize multilayer perceptrons (MLPs) over hundreds of thousands of scene-specific gradient-descent steps (30 min to 10 h for room-scale), yet remain susceptible to errors and artifacts—particularly floating geometry and holes—in vast, untextured, or occluded regions.
Gaussian Splatting methods speed novel-view synthesis by representing the scene as a sparse set of 3D Gaussian primitives, reducing render time but retaining the bottleneck of per-scene, heuristic-guided densification, which often leaves holes and curved artifacts on flat structures such as walls and ceilings. Heuristic densification (“grow near high-error rays”) is inadequately robust in under-observed or textureless regions and can result in incomplete or inaccurate reconstructions (Liu et al., 8 May 2025).
2. Data-Driven Gaussian Initialization
QuickSplat supplants the standard pipeline’s slow, rule-based initialization with a learned, end-to-end dense Gaussian initialization. The initializer network predicts a well-distributed set of 2D Gaussians directly from a sparse Structure-from-Motion (SfM) point cloud via a single feed-forward pass.
2.1. Surface Representation
- Each scene is represented as a set of Gaussian “splats” , where each encodes center (), scale (defining the covariance ), rotation (quaternion), opacity (), and diffuse color ().
- Rendering is performed by projecting each Gaussian along a ray to an elliptical disk via and alpha-blending the contributions along the ray.
2.2. Decoder-Style Initialization
- The sparse SfM point cloud is voxelized onto a 3D grid (voxel side length cm); occupied voxels are assigned learnable 64-dimensional features.
- A sparse 3D U-Net encoder–decoder with 4 down/upsampling layers predicts densified features, guided by an occupancy head at each upsampling stage.
- A small MLP decodes each feature into Gaussians (two splats per voxel); positions are decoded with , .
2.3. Training Losses
The initializer loss combines photometric consistency (), depth accuracy (), occupancy (), normal alignment (), and a distortion regularizer (), with a training objective:
Training is conducted on ScanNet++ (902 training scenes), directly optimizing the quality of scene initialization for robust downstream refinement.
3. Iterative Densification and Joint Optimization
After initialization, QuickSplat employs learned “optimization” steps, each iteration refining scene geometry and introducing new splats in under-modeled areas.
3.1. Pipeline Overview
- Rendering Gradients: For all training images, compute and aggregate the gradient of the total loss (photometric, depth, distortion) with respect to each voxel feature.
- Learned Densifier (): A sparse 3D CNN ingests current features, rendering gradients, and the step index , predicting candidate new voxel features . Up to new voxels are importance-sampled at each step.
- Learned Optimizer (): Combine and , concatenate zero gradients for new features, and refine all features via a sparse 3D U-Net. The update is bounded, with .
- Loss and Gradient Detachment: Densifier and optimizer are jointly supervised at each step; gradients are detached between steps, following meta-learning protocols.
The learned densifier eliminates heuristic densification, enabling robust, data-driven expansion in regions where photometric information is limited or ambiguous.
4. Experimental Protocol and Performance
4.1. Training and Inference Details
- Dataset: ScanNet++ (902 training scenes, 20 test scenes).
- Image Resolution: for training; for testing.
- Optimizer: Adam (), with batch accumulation from 100 random views per loop.
- Network Scale: M parameters; 2 Gaussians per $4$ cm voxel.
- Training Regime: Train for 3 days (RTX A6000), then jointly train for 5 steps per loop.
- Optional Fine-Tuning: 2,000 steps of vanilla SGD on all splats (26 s)—not strictly necessary for high performance.
4.2. Quantitative Results
QuickSplat’s performance is benchmarked against SuGaR, 2DGS, GS2Mesh, and MonoSDF. Metrics on ScanNet++ (20 test scenes) are summarized:
| Method | AbsErr (m) | Acc@2cm | Acc@5cm | Acc@10cm | Chamfer | Time |
|---|---|---|---|---|---|---|
| SuGaR | 0.2061 | 0.1157 | 0.2774 | 0.4794 | 0.2078 | 3,130s |
| 2DGS | 0.1127 | 0.4021 | 0.6027 | 0.7422 | 0.2420 | 1,796s |
| GS2Mesh | 0.1212 | 0.4028 | 0.6039 | 0.7406 | 0.2012 | 973s |
| MonoSDF | 0.0569 | 0.5774 | 0.8006 | 0.8850 | 0.1450 | >10h |
| QuickSplat w/o opt | 0.0732 | 0.5263 | 0.7674 | 0.8583 | 0.1461 | 26s |
| QuickSplat w/ opt | 0.0578 | 0.5783 | 0.8035 | 0.8887 | 0.1347 | 124s |
Key outcomes:
- Depth error reductions up to 48% versus SuGaR.
- 8 speedup over 2DGS; QuickSplat w/ opt is 300 faster than MonoSDF with matching accuracy.
- Even without opt, QuickSplat is 70 faster than MonoSDF and more accurate than all Gaussian-splatting baselines.
5. Qualitative Analysis and Observations
QuickSplat consistently reconstructs flat wall regions with high geometric fidelity, eliminating “floating” or curved patches common in baselines. The learned initializer and densifier allow rapid closure of holes in under-observed or textureless areas, such as white corners, and prevent the generation of flying fragments. Fine scene details (e.g., ladders, chair legs), often missed by monocular-depth–guided volumetric methods, are preserved due to the flexible, spatially adaptive Gaussian representation (Liu et al., 8 May 2025).
A plausible implication is that learned Gaussian representations generalize better across scene structures, particularly in challenging illumination and sparsity regimes.
6. Limitations and Future Directions
Mirrors and specular surfaces continue to challenge QuickSplat, resulting in occasional ghost geometry due to the reliance on photometric losses, which favor reconstructing reflections rather than true geometry. The method assumes static scenes and does not model dynamic or semi-dynamic objects. While QuickSplat greatly accelerates the reconstruction pipeline (to seconds or minutes), it is not yet real-time.
Prospective research directions include:
- Extension to dynamic/semi-dynamic settings via per-frame Gaussian updates.
- Integration into live SLAM pipelines, such as SplaTAM, for on-the-fly 3D reconstruction.
- Incorporation of more powerful, expressive data-driven priors such as learned shape grammars, potentially permitting effective initialization from even sparser observations.
7. Summary and Impact
QuickSplat demonstrates that learning both initialization and densification of Gaussian splats, as opposed to relying on per-scene gradient descent and manual heuristics, provides state-of-the-art performance for 3D surface reconstruction of large indoor environments. The method achieves orders-of-magnitude acceleration, significantly reduces depth errors in challenging regions, and produces high-fidelity geometry. Its modular design supports both rapid batch and incremental scene updates, providing a viable platform for future research in scene reconstruction and real-time spatial understanding (Liu et al., 8 May 2025).