IDESplat: Efficient 3D Gaussian Splatting
- IDESplat is an iterative depth estimation framework that fuses multi-warp epipolar attention maps via a novel Depth Probability Boosting Unit for precise 3D Gaussian splatting.
- It employs an iterative coarse-to-fine strategy that refines depth estimates over increasing resolutions, ensuring robust multi-view consistency.
- The method achieves state-of-the-art performance with reduced parameters and memory usage, enabling efficient real-time scene reconstruction and novel view synthesis.
IDESplat is an iterative depth estimation framework for generalizable 3D Gaussian Splatting (3DGS), designed to achieve precise depth probability estimation and efficient Gaussian parameter prediction in feed-forward scene reconstruction. Unlike prior approaches that rely on single-warp cost volumes and suffer from instability in depth estimation, IDESplat introduces novel architectural components—primarily the Depth Probability Boosting Unit (DPBU) and an iterative coarse-to-fine inference cascade—that multiplicatively fuse multi-warp epipolar attention maps for robust depth refinement. This strategy enables high-fidelity placement of 3D Gaussian centers, significantly advancing multi-view consistent scene representations and novel view synthesis within a memory- and parameter-efficient pipeline (Long et al., 7 Jan 2026).
1. Generalizable 3D Gaussian Splatting and Depth Estimation Challenges
Generalizable 3D Gaussian Splatting learns a feed-forward network that, given calibrated multi-view input, predicts per-pixel parameters for oriented 3D Gaussians—including mean (), covariance (), opacity (), and color (). The accurate prediction of the Gaussian center is notably challenging; prior works circumvent direct estimation by first predicting a depth map and unprojecting each pixel into 3D to form . The reliability of is vital, as misaligned depths propagate to erroneous 3D Gaussian placements and degrade rendered view fidelity.
Traditional methods like MVSplat, DepthSplat, and MonoSplat utilize single cross-view warp operations to compute per-pixel cost volumes, applying softmax across discrete depth candidates to obtain depth probabilities. However, single-warp constructions are subject to severe instability—particularly in occluded, textureless, or noisy regions—resulting in coarse, unreliable depth maps and suboptimal scene reconstructions.
2. Depth Probability Boosting Unit (DPBU): Architecture and Equations
A central element of IDESplat, the Depth Probability Boosting Unit (DPBU), resolves the instability of single-warp approaches by employing multiple parallel Warp-Index Epipolar Attention layers and multiplicative fusion strategies. The method operates at feature resolution over a set of depth candidates , leveraging source () and target () view features.
Warp-Index Epipolar Attention
For each source-target view pair, the index map
describes where target pixel projects into source view at depth , with denoting the warp indexing procedure. Sparse matrix multiplication constructs the D-channel epipolar correlation map:
where extracts features from as specified by and computes their inner products with normalized . After refinement by a 2D U-Net and upsampling to , the correlation becomes
with softmax applied to yield single-warp attention (depth-probability) maps:
Multiplicative Fusion of Attention Maps
Within each DPBU, parallel attention layers produce outputs , fused multiplicatively:
Here, designates element-wise multiplication over the depth axis, and Norm normalizes such that . Depth candidates receiving consistently high attention scores are amplified in the final , while outliers are suppressed.
3. Iterative Coarse-to-Fine Depth Estimation
IDESplat employs stacked DPBUs for progressive depth refinement, with each stage reranking depth candidates and improving spatial resolution. After the -th DPBU, the boosted probability map is transformed to a residual depth estimate:
At each iteration, depth candidates are re-centered and halved in range, and feature resolution is doubled (e.g., ), culminating in tightly bracketed, high-resolution depth maps. In practice, , yields six total warp passes, with the final iteration executed at resolution over refined candidate depths.
4. 3D Gaussian Parameter Prediction and Integration
Once the final per-pixel depth is obtained, unprojection recovers the 3D Gaussian center in camera ’s coordinate frame:
Other Gaussian parameters——are predicted in parallel using a dedicated six-layer Gaussian Focused Module, incorporating window-based transformers and sparse top- reweighting for local attention across 3D Gaussian neighborhoods. The enhanced accuracy of supports artifact-free splatting and improves overall reconstruction fidelity.
5. Training Pipeline and Implementation Details
IDESplat utilizes a feature backbone formed by fusing outputs from Unimatch (multi-view stereo, 1/4 resolution) and ViT-small DepthAnything V2 (monocular depth cues). Key configuration parameters include three DPBUs (each with two Warp-Index Epipolar Attention layers), escalating resolutions (64, 128, 256) and a Gaussian Focused Module with 6 heads, 256 channels.
Typical model size is 37.6M parameters (versus 354M for DepthSplat). Inference memory footprint is 2.3G (vs. 3.3G in DepthSplat) and runtime is 0.11s per forward pass at resolution (two-view input), with no explicit depth supervision; depth learning proceeds implicitly via minimization of view-synthesis error through
AdamW is used with 300K iterations, cosine learning rate decay, backbone learning rate of , remaining modules at , batch size 16 on 8 RTX4090 GPUs.
6. Empirical Performance and Benchmarks
IDESplat demonstrates state-of-the-art reconstruction and generalization on multiple datasets, summarized in the following table:
| Dataset | Model | Params | PSNR (dB) | SSIM | LPIPS | Notes |
|---|---|---|---|---|---|---|
| RealEstate10K | DepthSplat | 354M | 27.47 | 0.889 | 0.114 | |
| RealEstate10K | IDESplat | 37.6M | 27.80 (+0.33) | 0.893 | 0.107 | Param↓10.7%, Mem↓70% |
| ACID | MonoSplat | — | 28.63 | — | — | |
| ACID | IDESplat | — | 29.04 (+0.41) | — | — | |
| DTU (cross-dataset) | DepthSplat | — | 15.38 | — | — | Trained on RE10K |
| DTU (cross-dataset) | IDESplat | — | 18.33 (+2.95) | — | — | |
| DL3DV (2/4/6 views) | MVSplat/DepthSplat | — | — | — | — | IDESplat +0.4–0.6 dB |
IDESplat exhibits consistent improvements in multi-view consistency and generalization, outperforming established baselines across standard datasets with substantially reduced parameter count and memory usage (10.7% parameters, 70% memory of DepthSplat).
7. Technical Novelties, Efficiency, and Significance
IDESplat is characterized by:
- The Depth Probability Boosting Unit: A multiplicative fusion mechanism for aggregating multiple epipolar attention maps across warps, effectively enhancing reliable depth probability candidates while inhibiting outliers.
- Iterative coarse-to-fine architecture: Stacked DPBUs incrementally refine depth estimates over narrowing candidate brackets and increasing spatial resolution, achieving high-quality, robust depth maps.
- Real-time efficiency: The index-only warp structure enables the execution of six warp passes in practical real-time (0.11s/frame at standard input size), supporting fast feed-forward scene synthesis.
- Generalization and parameter efficiency: Empirical results demonstrate advanced generalization, multi-view consistency, and competitive rendering quality with drastically reduced computational overhead compared to predecessors such as DepthSplat.
A plausible implication is that IDESplat’s pipeline architecture and boosting-unit abstraction could inform future approaches to multi-view depth estimation and differentiable volumetric rendering, particularly in domains requiring compact models and deployment efficiency (Long et al., 7 Jan 2026).