IDESplat: Efficient 3D Gaussian Splatting

Updated 14 January 2026

IDESplat is an iterative depth estimation framework that fuses multi-warp epipolar attention maps via a novel Depth Probability Boosting Unit for precise 3D Gaussian splatting.
It employs an iterative coarse-to-fine strategy that refines depth estimates over increasing resolutions, ensuring robust multi-view consistency.
The method achieves state-of-the-art performance with reduced parameters and memory usage, enabling efficient real-time scene reconstruction and novel view synthesis.

IDESplat is an iterative depth estimation framework for generalizable 3D Gaussian Splatting (3DGS), designed to achieve precise depth probability estimation and efficient Gaussian parameter prediction in feed-forward scene reconstruction. Unlike prior approaches that rely on single-warp cost volumes and suffer from instability in depth estimation, IDESplat introduces novel architectural components—primarily the Depth Probability Boosting Unit (DPBU) and an iterative coarse-to-fine inference cascade—that multiplicatively fuse multi-warp epipolar attention maps for robust depth refinement. This strategy enables high-fidelity placement of 3D Gaussian centers, significantly advancing multi-view consistent scene representations and novel view synthesis within a memory- and parameter-efficient pipeline (Long et al., 7 Jan 2026).

1. Generalizable 3D Gaussian Splatting and Depth Estimation Challenges

Generalizable 3D Gaussian Splatting learns a feed-forward network that, given calibrated multi-view input, predicts per-pixel parameters for oriented 3D Gaussians—including mean ( $\mu$ ), covariance ( $\Sigma$ ), opacity ( $\alpha$ ), and color ( $c$ ). The accurate prediction of the Gaussian center $\mu$ is notably challenging; prior works circumvent direct estimation by first predicting a depth map $D(u,v)$ and unprojecting each pixel $(u,v, D(u,v))$ into 3D to form $\mu$ . The reliability of $D(\cdot)$ is vital, as misaligned depths propagate to erroneous 3D Gaussian placements and degrade rendered view fidelity.

Traditional methods like MVSplat, DepthSplat, and MonoSplat utilize single cross-view warp operations to compute per-pixel cost volumes, applying softmax across $D$ discrete depth candidates to obtain depth probabilities. However, single-warp constructions are subject to severe instability—particularly in occluded, textureless, or noisy regions—resulting in coarse, unreliable depth maps and suboptimal scene reconstructions.

2. Depth Probability Boosting Unit (DPBU): Architecture and Equations

A central element of IDESplat, the Depth Probability Boosting Unit (DPBU), resolves the instability of single-warp approaches by employing multiple parallel Warp-Index Epipolar Attention layers and multiplicative fusion strategies. The method operates at feature resolution $H' \times W'$ over a set of depth candidates $\mathcal{G} = \{d_1, ..., d_D\}$ , leveraging source ( $j$ ) and target ( $i$ ) view features.

Warp-Index Epipolar Attention

For each source-target view pair, the index map

$I^{j \to i}(x, m) = IW(F^j, P^i, P^j, \mathcal{G})_{x, m}$

describes where target pixel $x$ projects into source view $j$ at depth $d_m$ , with $IW(\cdot)$ denoting the warp indexing procedure. Sparse matrix multiplication constructs the D-channel epipolar correlation map:

$C^i(x, m) = \Psi\left( F^i(x), F^j, I^{j \to i}(x, m) \right)$

where $\Psi$ extracts features from $F^j$ as specified by $I$ and computes their inner products with normalized $F^i(x)$ . After refinement by a 2D U-Net and upsampling to $H \times W$ , the correlation becomes

$\tilde{C}^i(x, m) \in \mathbb{R}^{H \times W \times D}$

with softmax applied to yield single-warp attention (depth-probability) maps:

$A^i(x, m) = \text{softmax}_m(\tilde{C}^i(x, m))$

Multiplicative Fusion of Attention Maps

Within each DPBU, $M$ parallel attention layers produce outputs $\{A_1, \ldots, A_M\}$ , fused multiplicatively:

$P_0(x, m) := 1$

$P_m(x, m) = \text{Norm}(P_{m-1}(x, m) \odot A_m(x, m)), \quad m = 1, ..., M$

Here, $\odot$ designates element-wise multiplication over the depth axis, and Norm normalizes $P_m$ such that $\sum_m P_m(x, \cdot) = 1$ . Depth candidates receiving consistently high attention scores are amplified in the final $P_M$ , while outliers are suppressed.

3. Iterative Coarse-to-Fine Depth Estimation

IDESplat employs $N$ stacked DPBUs for progressive depth refinement, with each stage reranking depth candidates and improving spatial resolution. After the $k$ -th DPBU, the boosted probability map $P^{(k)}(x,m)$ is transformed to a residual depth estimate:

$\Delta D^{(k)}(x) = \sum_{m=1}^D P^{(k)}(x, m) \cdot d_m$

$D^{(k)}(x) = D^{(k-1)}(x) + \Delta D^{(k)}(x), \quad D^{(0)} \equiv 0$

At each iteration, depth candidates $\mathcal{G}$ are re-centered and halved in range, and feature resolution is doubled (e.g., $64^2 \rightarrow 128^2 \rightarrow 256^2$ ), culminating in tightly bracketed, high-resolution depth maps. In practice, $N=3$ , $M=2$ yields six total warp passes, with the final iteration executed at $256 \times 256$ resolution over refined candidate depths.

4. 3D Gaussian Parameter Prediction and Integration

Once the final per-pixel depth $D_{\text{final}}(u, v)$ is obtained, unprojection recovers the 3D Gaussian center in camera $i$ ’s coordinate frame:

$\mu_i(u, v) = (P^i)^{-1} [u, v, 1]^T D_{\text{final}}(u, v)$

Other Gaussian parameters— $\Sigma, \alpha, c$ —are predicted in parallel using a dedicated six-layer Gaussian Focused Module, incorporating window-based transformers and sparse top- $k$ reweighting for local attention across 3D Gaussian neighborhoods. The enhanced accuracy of $\mu_i$ supports artifact-free splatting and improves overall reconstruction fidelity.

5. Training Pipeline and Implementation Details

IDESplat utilizes a feature backbone formed by fusing outputs from Unimatch (multi-view stereo, 1/4 resolution) and ViT-small DepthAnything V2 (monocular depth cues). Key configuration parameters include three DPBUs (each with two Warp-Index Epipolar Attention layers), escalating resolutions (64 $^2$ , 128 $^2$ , 256 $^2$ ) and a Gaussian Focused Module with 6 heads, 256 channels.

Typical model size is $\sim$ 37.6M parameters (versus 354M for DepthSplat). Inference memory footprint is $\sim$ 2.3G (vs. 3.3G in DepthSplat) and runtime is $\sim$ 0.11s per forward pass at $256 \times 256$ resolution (two-view input), with no explicit depth supervision; depth learning proceeds implicitly via minimization of view-synthesis error through

$L_{\text{color}} = \| \hat{Y} - Y \|^2_{L2} + \lambda \text{LPIPS}(\hat{Y}, Y)$

AdamW is used with 300K iterations, cosine learning rate decay, backbone learning rate of $2 \times 10^{-6}$ , remaining modules at $2 \times 10^{-4}$ , batch size 16 on 8 RTX4090 GPUs.

6. Empirical Performance and Benchmarks

IDESplat demonstrates state-of-the-art reconstruction and generalization on multiple datasets, summarized in the following table:

Dataset	Model	Params	PSNR (dB)	SSIM	LPIPS	Notes
RealEstate10K	DepthSplat	354M	27.47	0.889	0.114
RealEstate10K	IDESplat	37.6M	27.80 (+0.33)	0.893	0.107	Param↓10.7%, Mem↓70%
ACID	MonoSplat	—	28.63	—	—
ACID	IDESplat	—	29.04 (+0.41)	—	—
DTU (cross-dataset)	DepthSplat	—	15.38	—	—	Trained on RE10K
DTU (cross-dataset)	IDESplat	—	18.33 (+2.95)	—	—
DL3DV (2/4/6 views)	MVSplat/DepthSplat	—	—	—	—	IDESplat +0.4–0.6 dB

IDESplat exhibits consistent improvements in multi-view consistency and generalization, outperforming established baselines across standard datasets with substantially reduced parameter count and memory usage ( $\sim$ 10.7% parameters, 70% memory of DepthSplat).

7. Technical Novelties, Efficiency, and Significance

IDESplat is characterized by:

The Depth Probability Boosting Unit: A multiplicative fusion mechanism for aggregating multiple epipolar attention maps across warps, effectively enhancing reliable depth probability candidates while inhibiting outliers.
Iterative coarse-to-fine architecture: Stacked DPBUs incrementally refine depth estimates over narrowing candidate brackets and increasing spatial resolution, achieving high-quality, robust depth maps.
Real-time efficiency: The index-only warp structure enables the execution of six warp passes in practical real-time ( $\sim$ 0.11s/frame at standard input size), supporting fast feed-forward scene synthesis.
Generalization and parameter efficiency: Empirical results demonstrate advanced generalization, multi-view consistency, and competitive rendering quality with drastically reduced computational overhead compared to predecessors such as DepthSplat.

A plausible implication is that IDESplat’s pipeline architecture and boosting-unit abstraction could inform future approaches to multi-view depth estimation and differentiable volumetric rendering, particularly in domains requiring compact models and deployment efficiency (Long et al., 7 Jan 2026).

Markdown Upgrade to Chat

References (1)

IDESplat: Iterative Depth Probability Estimation for Generalizable 3D Gaussian Splatting (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to IDESplat.