- The paper introduces a novel learned initializer, densifier, and optimizer network that replaces heuristic methods for swift, accurate 3D surface reconstruction.
- It employs a sparse 3D voxel grid with 2D Gaussian Splatting to effectively fill holes and manage under-observed, textureless regions.
- Experiments on ScanNet++ and ARKitScenes demonstrate 8x acceleration and improved geometry quality, benefiting digital twins, AR/VR, and robotics mapping.
QuickSplat (2505.05591) addresses the challenges in 3D surface reconstruction from multi-view images, particularly the slow per-scene optimization times and difficulties in reconstructing textureless or under-observed regions. The paper proposes a data-driven approach that learns priors for initializing, densifying, and optimizing a 2D Gaussian Splatting (2DGS) representation, enabling significantly faster and more accurate reconstructions, especially for large indoor scenes.
The core idea is to replace the traditional heuristic-based initialization and gradient-descent optimization of 2DGS with learned neural networks. QuickSplat uses a sparse 3D voxel grid structure to align the Gaussian primitives and their associated latent features.
The method consists of three main components:
- Learned Initializer Network: Instead of relying solely on sparse Structure-from-Motion (SfM) point clouds, a sparse 3D convolutional encoder-decoder network is trained to predict dense initial Gaussian parameters (position, scale, rotation, opacity, color) from the SfM points. This network is designed to increase voxel density and fill holes, particularly in areas where SfM fails, like textureless walls. The initializer is supervised using a combination of rendering loss (Lc), depth loss (Ld), an occupancy loss (Locc) against ground truth mesh geometry, and a normal loss (Ln) comparing Gaussian normals to mesh normals.
- The position decoding uses the voxel center vc and a learned offset within a radius R=4vd relative to the voxel: gc=vc+R(2σ(x)−1).
- Opacity is directly mapped from the occupancy prediction after the last upsampling layer, allowing rendering loss to inform voxel allocation.
- Learned Densifier Network: To address remaining holes or areas requiring more detail after initialization, a densifier network is introduced. It takes the current state of Gaussians (Gt) and their rendering gradients (∇Gt) as input. Similar in architecture to the initializer but without dense bottleneck blocks, it predicts additional voxel features (new Gaussians) neighboring existing ones. The number of new Gaussians added at each step decreases over time. An importance sampling strategy, weighted by the network's predicted occupancy (used as opacity), is used to select new Gaussians, prioritizing those more likely to contribute to the surface. This replaces heuristic-based densification methods.
- Learned Optimizer Network: An optimizer network (sparse 3D UNet) learns to predict update vectors (ΔGt) for the latent features of existing Gaussians based on Gt, ∇Gt, and the current timestep (t). This process mimics an optimization step but is learned from data.
These three components are integrated into a densification-optimization loop. After initializing with the initializer network, the loop iterates for a fixed number of timesteps (e.g., T=5). In each timestep:
- Rendering gradients are computed for all training images with respect to the current Gaussian parameters (or their latent features).
- The densifier predicts additional voxel features based on the current state and gradients.
- Existing and new voxel features are concatenated.
- The optimizer predicts update vectors for all features.
- The features are updated using the predicted vectors.
The densifier and optimizer networks are trained jointly in an end-to-end manner in a second training stage, while the initializer remains frozen. Losses are applied at each timestep of the optimization loop, with gradients detached between timesteps to focus on predicting the next improvement step.
After the learned iterative optimization, the surface mesh is extracted by rendering depth maps from training views and performing TSDF fusion, similar to 2DGS. Optional post-training with a few thousand steps of standard gradient descent can further refine the results.
Implementation Considerations:
- Voxel Grid: The choice of voxel size (vd) and number of Gaussians per voxel (vg) impacts resolution and memory. The paper uses vd=4cm and vg=2.
- Network Architecture: Sparse 3D convolutions are used for efficiency in handling large scenes. The optimizer uses a UNet structure.
- Training Stages: Training is split into two stages: initializer training and joint densifier/optimizer training. This modularity allows focusing on different aspects of the reconstruction process.
- Gradient Accumulation: Gradients are accumulated across a batch of training images (e.g., 100) before being fed to the optimizer.
- Memory Management: Importance sampling in the densifier is crucial to limit the number of new Gaussians added per iteration and manage memory usage during training.
- Loss Functions: A combination of photometric (Lc), geometric (Ld, Ln), and structural (Locc, Ldist) losses are necessary to supervise the different networks effectively towards accurate surface reconstruction.
Real-World Application:
QuickSplat is primarily demonstrated for large-scale indoor scene reconstruction using multi-view RGB images. Its advantages make it suitable for applications requiring fast and accurate 3D models from image sequences, such as:
- Digital Twin creation: Rapidly generating detailed 3D models of indoor environments for simulation, visualization, or asset management.
- AR/VR content creation: Creating static 3D environments for immersive experiences faster than traditional methods.
- Robotics mapping: Generating dense and accurate maps for navigation and interaction in complex indoor spaces. The learned priors help in challenging areas often encountered by robots (e.g., large, featureless walls).
The method demonstrates significant speed-ups (8x faster optimization) and improved geometry quality (lower depth errors and Chamfer distance) compared to state-of-the-art baselines like 2DGS, SuGaR, and GS2Mesh on the ScanNet++ dataset. It also shows promising generalization results on the ARKitScenes dataset without fine-tuning.
While QuickSplat significantly improves efficiency and robustness in many cases, limitations remain, such as difficulties with highly reflective surfaces and dynamic scenes. Future work could explore integrating it with real-time SLAM systems for online reconstruction or extending it to handle non-static environments.