QuickSplat: Fast 3D Surface Reconstruction via Learned Gaussian Initialization

Published 8 May 2025 in cs.CV | (2505.05591v1)

Abstract: Surface reconstruction is fundamental to computer vision and graphics, enabling applications in 3D modeling, mixed reality, robotics, and more. Existing approaches based on volumetric rendering obtain promising results, but optimize on a per-scene basis, resulting in a slow optimization that can struggle to model under-observed or textureless regions. We introduce QuickSplat, which learns data-driven priors to generate dense initializations for 2D gaussian splatting optimization of large-scale indoor scenes. This provides a strong starting point for the reconstruction, which accelerates the convergence of the optimization and improves the geometry of flat wall structures. We further learn to jointly estimate the densification and update of the scene parameters during each iteration; our proposed densifier network predicts new Gaussians based on the rendering gradients of existing ones, removing the needs of heuristics for densification. Extensive experiments on large-scale indoor scene reconstruction demonstrate the superiority of our data-driven optimization. Concretely, we accelerate runtime by 8x, while decreasing depth errors by up to 48% in comparison to state of the art methods.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a novel learned initializer, densifier, and optimizer network that replaces heuristic methods for swift, accurate 3D surface reconstruction.
It employs a sparse 3D voxel grid with 2D Gaussian Splatting to effectively fill holes and manage under-observed, textureless regions.
Experiments on ScanNet++ and ARKitScenes demonstrate 8x acceleration and improved geometry quality, benefiting digital twins, AR/VR, and robotics mapping.

QuickSplat (2505.05591) addresses the challenges in 3D surface reconstruction from multi-view images, particularly the slow per-scene optimization times and difficulties in reconstructing textureless or under-observed regions. The paper proposes a data-driven approach that learns priors for initializing, densifying, and optimizing a 2D Gaussian Splatting (2DGS) representation, enabling significantly faster and more accurate reconstructions, especially for large indoor scenes.

The core idea is to replace the traditional heuristic-based initialization and gradient-descent optimization of 2DGS with learned neural networks. QuickSplat uses a sparse 3D voxel grid structure to align the Gaussian primitives and their associated latent features.

The method consists of three main components:

Learned Initializer Network: Instead of relying solely on sparse Structure-from-Motion (SfM) point clouds, a sparse 3D convolutional encoder-decoder network is trained to predict dense initial Gaussian parameters (position, scale, rotation, opacity, color) from the SfM points. This network is designed to increase voxel density and fill holes, particularly in areas where SfM fails, like textureless walls. The initializer is supervised using a combination of rendering loss ( $\mathcal{L}_\text{c}$ $L_{c}$ ), depth loss ( $\mathcal{L}_\text{d}$ $L_{d}$ ), an occupancy loss ( $\mathcal{L}_\text{occ}$ $L_{occ}$ ) against ground truth mesh geometry, and a normal loss ( $\mathcal{L}_\text{n}$ $L_{n}$ ) comparing Gaussian normals to mesh normals.
- The position decoding uses the voxel center $\text{v}_\text{c}$ and a learned offset within a radius $\text{R} = 4\text{v}_\text{d}$ relative to the voxel: $\mathbf{g}_c = \text{v}_\text{c} + \text{R} (2 \sigma(x) - 1)$ .
- Opacity is directly mapped from the occupancy prediction after the last upsampling layer, allowing rendering loss to inform voxel allocation.
Learned Densifier Network: To address remaining holes or areas requiring more detail after initialization, a densifier network is introduced. It takes the current state of Gaussians ( $\mathcal{G}_t$ ) and their rendering gradients ( $\nabla \mathcal{G}_t$ ) as input. Similar in architecture to the initializer but without dense bottleneck blocks, it predicts additional voxel features (new Gaussians) neighboring existing ones. The number of new Gaussians added at each step decreases over time. An importance sampling strategy, weighted by the network's predicted occupancy (used as opacity), is used to select new Gaussians, prioritizing those more likely to contribute to the surface. This replaces heuristic-based densification methods.
Learned Optimizer Network: An optimizer network (sparse 3D UNet) learns to predict update vectors ( $\Delta \mathcal{G}_t$ ) for the latent features of existing Gaussians based on $\mathcal{G}_t$ , $\nabla \mathcal{G}_t$ , and the current timestep ( $t$ ). This process mimics an optimization step but is learned from data.

These three components are integrated into a densification-optimization loop. After initializing with the initializer network, the loop iterates for a fixed number of timesteps (e.g., $T=5$ ). In each timestep:

Rendering gradients are computed for all training images with respect to the current Gaussian parameters (or their latent features).
The densifier predicts additional voxel features based on the current state and gradients.
Existing and new voxel features are concatenated.
The optimizer predicts update vectors for all features.
The features are updated using the predicted vectors.

The densifier and optimizer networks are trained jointly in an end-to-end manner in a second training stage, while the initializer remains frozen. Losses are applied at each timestep of the optimization loop, with gradients detached between timesteps to focus on predicting the next improvement step.

After the learned iterative optimization, the surface mesh is extracted by rendering depth maps from training views and performing TSDF fusion, similar to 2DGS. Optional post-training with a few thousand steps of standard gradient descent can further refine the results.

Implementation Considerations:

Voxel Grid: The choice of voxel size ( $\text{v}_\text{d}$ ) and number of Gaussians per voxel ( $\text{v}_\text{g}$ ) impacts resolution and memory. The paper uses $\text{v}_\text{d}=4$ cm and $\text{v}_\text{g}=2$ .
Network Architecture: Sparse 3D convolutions are used for efficiency in handling large scenes. The optimizer uses a UNet structure.
Training Stages: Training is split into two stages: initializer training and joint densifier/optimizer training. This modularity allows focusing on different aspects of the reconstruction process.
Gradient Accumulation: Gradients are accumulated across a batch of training images (e.g., 100) before being fed to the optimizer.
Memory Management: Importance sampling in the densifier is crucial to limit the number of new Gaussians added per iteration and manage memory usage during training.
Loss Functions: A combination of photometric ( $\mathcal{L}_\text{c}$ ), geometric ( $\mathcal{L}_\text{d}$ , $\mathcal{L}_\text{n}$ ), and structural ( $\mathcal{L}_\text{occ}$ , $\mathcal{L}_\text{dist}$ ) losses are necessary to supervise the different networks effectively towards accurate surface reconstruction.

Real-World Application:

QuickSplat is primarily demonstrated for large-scale indoor scene reconstruction using multi-view RGB images. Its advantages make it suitable for applications requiring fast and accurate 3D models from image sequences, such as:

Digital Twin creation: Rapidly generating detailed 3D models of indoor environments for simulation, visualization, or asset management.
AR/VR content creation: Creating static 3D environments for immersive experiences faster than traditional methods.
Robotics mapping: Generating dense and accurate maps for navigation and interaction in complex indoor spaces. The learned priors help in challenging areas often encountered by robots (e.g., large, featureless walls).

The method demonstrates significant speed-ups (8x faster optimization) and improved geometry quality (lower depth errors and Chamfer distance) compared to state-of-the-art baselines like 2DGS, SuGaR, and GS2Mesh on the ScanNet++ dataset. It also shows promising generalization results on the ARKitScenes dataset without fine-tuning.

While QuickSplat significantly improves efficiency and robustness in many cases, limitations remain, such as difficulties with highly reflective surfaces and dynamic scenes. Future work could explore integrating it with real-time SLAM systems for online reconstruction or extending it to handle non-static environments.

Markdown

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

QuickSplat: Fast 3D Surface Reconstruction via Learned Gaussian Initialization

Summary

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Authors (4)

Collections

Tweets

QuickSplat: Fast 3D Surface Reconstruction via Learned Gaussian Initialization

Summary

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Related Papers

Authors (4)

Collections

Tweets