Papers
Topics
Authors
Recent
2000 character limit reached

IDESplat: Efficient 3D Gaussian Splatting

Updated 14 January 2026
  • IDESplat is an iterative depth estimation framework that fuses multi-warp epipolar attention maps via a novel Depth Probability Boosting Unit for precise 3D Gaussian splatting.
  • It employs an iterative coarse-to-fine strategy that refines depth estimates over increasing resolutions, ensuring robust multi-view consistency.
  • The method achieves state-of-the-art performance with reduced parameters and memory usage, enabling efficient real-time scene reconstruction and novel view synthesis.

IDESplat is an iterative depth estimation framework for generalizable 3D Gaussian Splatting (3DGS), designed to achieve precise depth probability estimation and efficient Gaussian parameter prediction in feed-forward scene reconstruction. Unlike prior approaches that rely on single-warp cost volumes and suffer from instability in depth estimation, IDESplat introduces novel architectural components—primarily the Depth Probability Boosting Unit (DPBU) and an iterative coarse-to-fine inference cascade—that multiplicatively fuse multi-warp epipolar attention maps for robust depth refinement. This strategy enables high-fidelity placement of 3D Gaussian centers, significantly advancing multi-view consistent scene representations and novel view synthesis within a memory- and parameter-efficient pipeline (Long et al., 7 Jan 2026).

1. Generalizable 3D Gaussian Splatting and Depth Estimation Challenges

Generalizable 3D Gaussian Splatting learns a feed-forward network that, given calibrated multi-view input, predicts per-pixel parameters for oriented 3D Gaussians—including mean (μ\mu), covariance (Σ\Sigma), opacity (α\alpha), and color (cc). The accurate prediction of the Gaussian center μ\mu is notably challenging; prior works circumvent direct estimation by first predicting a depth map D(u,v)D(u,v) and unprojecting each pixel (u,v,D(u,v))(u,v, D(u,v)) into 3D to form μ\mu. The reliability of D(⋅)D(\cdot) is vital, as misaligned depths propagate to erroneous 3D Gaussian placements and degrade rendered view fidelity.

Traditional methods like MVSplat, DepthSplat, and MonoSplat utilize single cross-view warp operations to compute per-pixel cost volumes, applying softmax across DD discrete depth candidates to obtain depth probabilities. However, single-warp constructions are subject to severe instability—particularly in occluded, textureless, or noisy regions—resulting in coarse, unreliable depth maps and suboptimal scene reconstructions.

2. Depth Probability Boosting Unit (DPBU): Architecture and Equations

A central element of IDESplat, the Depth Probability Boosting Unit (DPBU), resolves the instability of single-warp approaches by employing multiple parallel Warp-Index Epipolar Attention layers and multiplicative fusion strategies. The method operates at feature resolution H′×W′H' \times W' over a set of depth candidates G={d1,...,dD}\mathcal{G} = \{d_1, ..., d_D\}, leveraging source (jj) and target (ii) view features.

Warp-Index Epipolar Attention

For each source-target view pair, the index map

Ij→i(x,m)=IW(Fj,Pi,Pj,G)x,mI^{j \to i}(x, m) = IW(F^j, P^i, P^j, \mathcal{G})_{x, m}

describes where target pixel xx projects into source view jj at depth dmd_m, with IW(â‹…)IW(\cdot) denoting the warp indexing procedure. Sparse matrix multiplication constructs the D-channel epipolar correlation map:

Ci(x,m)=Ψ(Fi(x),Fj,Ij→i(x,m))C^i(x, m) = \Psi\left( F^i(x), F^j, I^{j \to i}(x, m) \right)

where Ψ\Psi extracts features from FjF^j as specified by II and computes their inner products with normalized Fi(x)F^i(x). After refinement by a 2D U-Net and upsampling to H×WH \times W, the correlation becomes

C~i(x,m)∈RH×W×D\tilde{C}^i(x, m) \in \mathbb{R}^{H \times W \times D}

with softmax applied to yield single-warp attention (depth-probability) maps:

Ai(x,m)=softmaxm(C~i(x,m))A^i(x, m) = \text{softmax}_m(\tilde{C}^i(x, m))

Multiplicative Fusion of Attention Maps

Within each DPBU, MM parallel attention layers produce outputs {A1,…,AM}\{A_1, \ldots, A_M\}, fused multiplicatively:

P0(x,m):=1P_0(x, m) := 1

Pm(x,m)=Norm(Pm−1(x,m)⊙Am(x,m)),m=1,...,MP_m(x, m) = \text{Norm}(P_{m-1}(x, m) \odot A_m(x, m)), \quad m = 1, ..., M

Here, ⊙\odot designates element-wise multiplication over the depth axis, and Norm normalizes PmP_m such that ∑mPm(x,⋅)=1\sum_m P_m(x, \cdot) = 1. Depth candidates receiving consistently high attention scores are amplified in the final PMP_M, while outliers are suppressed.

3. Iterative Coarse-to-Fine Depth Estimation

IDESplat employs NN stacked DPBUs for progressive depth refinement, with each stage reranking depth candidates and improving spatial resolution. After the kk-th DPBU, the boosted probability map P(k)(x,m)P^{(k)}(x,m) is transformed to a residual depth estimate:

ΔD(k)(x)=∑m=1DP(k)(x,m)⋅dm\Delta D^{(k)}(x) = \sum_{m=1}^D P^{(k)}(x, m) \cdot d_m

D(k)(x)=D(k−1)(x)+ΔD(k)(x),D(0)≡0D^{(k)}(x) = D^{(k-1)}(x) + \Delta D^{(k)}(x), \quad D^{(0)} \equiv 0

At each iteration, depth candidates G\mathcal{G} are re-centered and halved in range, and feature resolution is doubled (e.g., 642→1282→256264^2 \rightarrow 128^2 \rightarrow 256^2), culminating in tightly bracketed, high-resolution depth maps. In practice, N=3N=3, M=2M=2 yields six total warp passes, with the final iteration executed at 256×256256 \times 256 resolution over refined candidate depths.

4. 3D Gaussian Parameter Prediction and Integration

Once the final per-pixel depth Dfinal(u,v)D_{\text{final}}(u, v) is obtained, unprojection recovers the 3D Gaussian center in camera ii’s coordinate frame:

μi(u,v)=(Pi)−1[u,v,1]TDfinal(u,v)\mu_i(u, v) = (P^i)^{-1} [u, v, 1]^T D_{\text{final}}(u, v)

Other Gaussian parameters—Σ,α,c\Sigma, \alpha, c—are predicted in parallel using a dedicated six-layer Gaussian Focused Module, incorporating window-based transformers and sparse top-kk reweighting for local attention across 3D Gaussian neighborhoods. The enhanced accuracy of μi\mu_i supports artifact-free splatting and improves overall reconstruction fidelity.

5. Training Pipeline and Implementation Details

IDESplat utilizes a feature backbone formed by fusing outputs from Unimatch (multi-view stereo, 1/4 resolution) and ViT-small DepthAnything V2 (monocular depth cues). Key configuration parameters include three DPBUs (each with two Warp-Index Epipolar Attention layers), escalating resolutions (642^2, 1282^2, 2562^2) and a Gaussian Focused Module with 6 heads, 256 channels.

Typical model size is ∼\sim37.6M parameters (versus 354M for DepthSplat). Inference memory footprint is ∼\sim2.3G (vs. 3.3G in DepthSplat) and runtime is ∼\sim0.11s per forward pass at 256×256256 \times 256 resolution (two-view input), with no explicit depth supervision; depth learning proceeds implicitly via minimization of view-synthesis error through

Lcolor=∥Y^−Y∥L22+λLPIPS(Y^,Y)L_{\text{color}} = \| \hat{Y} - Y \|^2_{L2} + \lambda \text{LPIPS}(\hat{Y}, Y)

AdamW is used with 300K iterations, cosine learning rate decay, backbone learning rate of 2×10−62 \times 10^{-6}, remaining modules at 2×10−42 \times 10^{-4}, batch size 16 on 8 RTX4090 GPUs.

6. Empirical Performance and Benchmarks

IDESplat demonstrates state-of-the-art reconstruction and generalization on multiple datasets, summarized in the following table:

Dataset Model Params PSNR (dB) SSIM LPIPS Notes
RealEstate10K DepthSplat 354M 27.47 0.889 0.114
RealEstate10K IDESplat 37.6M 27.80 (+0.33) 0.893 0.107 Param↓10.7%, Mem↓70%
ACID MonoSplat — 28.63 — —
ACID IDESplat — 29.04 (+0.41) — —
DTU (cross-dataset) DepthSplat — 15.38 — — Trained on RE10K
DTU (cross-dataset) IDESplat — 18.33 (+2.95) — —
DL3DV (2/4/6 views) MVSplat/DepthSplat — — — — IDESplat +0.4–0.6 dB

IDESplat exhibits consistent improvements in multi-view consistency and generalization, outperforming established baselines across standard datasets with substantially reduced parameter count and memory usage (∼\sim10.7% parameters, 70% memory of DepthSplat).

7. Technical Novelties, Efficiency, and Significance

IDESplat is characterized by:

  • The Depth Probability Boosting Unit: A multiplicative fusion mechanism for aggregating multiple epipolar attention maps across warps, effectively enhancing reliable depth probability candidates while inhibiting outliers.
  • Iterative coarse-to-fine architecture: Stacked DPBUs incrementally refine depth estimates over narrowing candidate brackets and increasing spatial resolution, achieving high-quality, robust depth maps.
  • Real-time efficiency: The index-only warp structure enables the execution of six warp passes in practical real-time (∼\sim0.11s/frame at standard input size), supporting fast feed-forward scene synthesis.
  • Generalization and parameter efficiency: Empirical results demonstrate advanced generalization, multi-view consistency, and competitive rendering quality with drastically reduced computational overhead compared to predecessors such as DepthSplat.

A plausible implication is that IDESplat’s pipeline architecture and boosting-unit abstraction could inform future approaches to multi-view depth estimation and differentiable volumetric rendering, particularly in domains requiring compact models and deployment efficiency (Long et al., 7 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to IDESplat.