Papers
Topics
Authors
Recent
2000 character limit reached

EcoSplat: Controllable 3D Gaussian Splatting

Updated 25 December 2025
  • The paper introduces EcoSplat, an efficient feed-forward 3D Gaussian Splatting framework that allows explicit control over the number of output primitives through a dual-stage training process.
  • It employs Pixel-aligned Gaussian Training and Importance-aware Gaussian Finetuning to suppress redundant primitives and maintain high rendering fidelity under tight efficiency constraints.
  • Empirical evaluations on datasets like RealEstate10K and ACID demonstrate that EcoSplat achieves state-of-the-art quality with up to 10× fewer primitives compared to existing methods.

EcoSplat is an efficiency-controllable feed-forward 3D Gaussian Splatting (3DGS) framework designed for adaptive scene reconstruction and novel view synthesis from multi-view images. It offers explicit control over the number of 3D Gaussian primitives (“K”) output at inference, addressing the inefficiency of prior feed-forward 3DGS methods that predict pixel-aligned primitives per-view and cannot regulate primitive count, especially in dense-view settings. The approach comprises two coupled training stages—Pixel-aligned Gaussian Training (PGT) and Importance-aware Gaussian Finetuning (IGF)—followed by a principled selection and filtering mechanism at inference to obtain high-fidelity, compact 3D representations suitable for flexible rendering tasks (Park et al., 21 Dec 2025).

1. Foundational Concepts and Motivation

Feed-forward 3D Gaussian Splatting (3DGS) generates 3D scene representations by predicting anisotropic 3D Gaussian primitives from multiple calibrated RGB views, enabling efficient, one-pass reconstruction without per-scene optimization. Prior architectures produce pixel-aligned primitives for each view, leading to excessive redundancy when aggregating across views, particularly as the number of input images grows. Existing baselines lack mechanisms for directly constraining the number of predicted Gaussians, resulting in memory and compute inefficiencies or uncontrolled quality degradation when attempting to prune or merge primitives.

EcoSplat introduces the first explicit, user-controlled mechanism for specifying the number of output primitives (“K”) via adaptive network conditioning, supervised importance signals, and architectural modifications, delivering state-of-the-art fidelity under tight efficiency budgets. The method is suitable for static scenes with calibrated multi-view input.

2. Two-Stage Training Procedure

2.1 Pixel-aligned Gaussian Training (PGT)

The PGT stage predicts Gaussian primitives directly from multi-view input images. Each of the NN views {Ii}i=1N\{I_i\}_{i=1}^N (of spatial size H×WH\times W pixels) is processed by a shared Vision Transformer (ViT) encoder and a cross-view decoder. For each pixel in every view, the network predicts:

  • Center μi,jR3\mu_{i,j}\in\mathbb{R}^3
  • Covariance Σi,jR3×3\Sigma_{i,j}\in\mathbb{R}^{3\times3}
  • Color ci,jR3c_{i,j}\in\mathbb{R}^3
  • Opacity αi,j\alpha_{i,j}

Specifically,

{μi,j}j=1HW=Fμ(Zi())RHW×3\{\mu_{i,j}\}_{j=1}^{HW} = F_{\mu}(Z_i^{(\ell)}) \in\mathbb{R}^{HW\times 3}

{[αi,j;Σi,j;ci,j]}j=1HW=Fν({Zi()},ψ(Ii))\{[\alpha_{i,j};\,\Sigma_{i,j};\,c_{i,j}]\}_{j=1}^{HW} = F_\nu(\{Z_i^{(\ell)}\}, \psi(I_i))

with Zi()Z_i^{(\ell)} tokenized ViT outputs and ψ(Ii)\psi(I_i) shallow-CNN features.

Supervision occurs via differentiable splatting and photometric loss on held-out novel target views: Lrender=1Ntgtp=1Ntgt[LMSE(Iptgt,I^ptgt)+0.05LLPIPS(Iptgt,I^ptgt)].\mathcal{L}_{\rm render} = \frac{1}{N^{\rm tgt}} \sum_{p=1}^{N^{\rm tgt}} \Big[\mathcal{L}_{\rm MSE}(I_p^{\rm tgt}, \hat{I}_p^{\rm tgt}) + 0.05\,\mathcal{L}_{\rm LPIPS}(I_p^{\rm tgt}, \hat{I}_p^{\rm tgt})\Big].

2.2 Importance-aware Gaussian Finetuning (IGF)

To enable primitive-count control, a second stage reuses the pretrained ViT and center head but injects a target count KK via a learnable “importance embedding” RiR_i in the parameter head FνF_\nu. The critical innovation is the suppression of low-importance Gaussians by adjusting their opacities α~i,j\tilde{\alpha}_{i,j} so that the KK most significant primitives can be selected at inference while maintaining image fidelity.

A global preservation ratio ρ=K/(NHW)\rho = K/(N H W) is computed and broadcast via a CNN to generate embedding RiR_i, allowing the network to adapt prediction density. Mask generation for supervision uses combined photometric (image gradients) and geometric (surface normal gradients) variation:

  • Photometric: $g_{\rm photo},i}(x,y) = \sqrt{\|\nabla_x I_i\|_2^2 + \|\nabla_y I_i\|_2^2}$
  • Geometric: $g_{\rm geo},i} = \sqrt{\|\nabla_x n_i\|_2^2 + \|\nabla_y n_i\|_2^2}$
  • Importance mask gi=(gphoto,i+ggeo,i)/2g_i = (g_{\rm photo},i + g_{\rm geo},i)/2

High-variation Gaussians are retained directly, while low-variation regions are compacted via K-means within 4×44\times 4 spatial patches. Binary masks Ωi\Omega_i supervise importance-aware opacity via binary cross-entropy loss: Lio=λio1NHWi=1Nj=1HWBCE(Ωi,j,α~i,j).\mathcal{L}_{io} = \lambda_{io} \frac{1}{N H W} \sum_{i=1}^N \sum_{j=1}^{HW} \mathrm{BCE}(\Omega_{i,j}, \tilde{\alpha}_{i,j}). With λio=0.1\lambda_{io}=0.1, plus a KK-constrained rendering loss using only the top KK Gaussians: L=Lio+LK-render.\mathcal{L} = \mathcal{L}_{io} + \mathcal{L}_{K\text{-render}}. To enhance robustness, Progressive Learning on Gaussian Compaction (PLGC) samples KK per batch from [Kmin,Kmax][K_{\text{min}}, K_{\text{max}}]: Kmax=0.95NHW,Kmin=max(0.85λdecayt/S,0.05)×NHW.K_{\max}=0.95 N H W\,,\quad K_{\min} = \max(0.85-\lambda_{\rm decay}\lfloor t/S\rfloor,\,0.05) \times N H W.

3. Inference-time Primitive Count Control

At test time, primitive selection is modulated as follows:

  1. Compute the global ratio ρ=K/(NHW)ρ = K/(N H W).
  2. For each input view ii, compute a high-frequency score ηi\eta_i using the 2D FFT of IiI_i,

ηi=1ξΛFFT(Ii)(ξ)ξFFT(Ii)(ξ)\eta_i = 1 - \frac{\sum_{\xi\in\Lambda} |\mathrm{FFT}(I_i)|(\xi)}{\sum_{\xi} |\mathrm{FFT}(I_i)|(\xi)}

where Λ\Lambda is the low-frequency band.

  1. Distribute the overall budget using softmax with temperature T=0.2T=0.2: Ψi=exp(ηi/T)/qexp(ηq/T)\Psi_i = \exp(\eta_i/T)/\sum_q \exp(\eta_q/T), then allocate ρi=NΨiρ\rho_i = N \Psi_i \rho per view.
  2. Via FνF_\nu and injection of RiR_i (updated for each ρi\rho_i), predict Gaussians and select the top ρiHW\lceil\rho_i H W\rceil per view, ranked by adjusted opacity α~\tilde{\alpha}.
  3. Aggregate across views and adjust to ensure the total output primitive count is precisely KK.

This mechanism yields a globally uniform or adaptively non-uniform distribution of spatial detail, providing real-time rendering capability and direct user control over resource-accuracy tradeoffs.

4. Architectural and Training Details

The architecture comprises:

  • Shallow CNN (ψ\psi) for initial image features.
  • Shared ViT encoder (patch tokens, e.g., 16×1616\times16 patches).
  • m=4m=4 ViT decoder layers with cross-attention for multi-view fusion.
  • Heads: FμF_{\mu} (MLP for 3D center prediction), FνF_\nu (MLP for opacity, covariance, color prediction with RiR_i injection).
  • “Shallow Add” injection strategy: RiR_i is added after convolutional refinement to spatial tokens.
  • Training: Both PGT and IGF stages use 200k iterations each, with ViT and FμF_\mu initialized from MASt3R pretraining.
  • Implementation: Trained on 4 NVIDIA A100 GPUs (40GB), using the gsplat rasterization backend.

Key hyperparameters include λio=0.1λ_{io}=0.1, learning rate decay λdecay=0.05λ_{\rm decay}=0.05, sampling interval S=1000S=1000, inference softmax temperature T=0.2T=0.2.

5. Empirical Performance and Evaluation

EcoSplat was validated on RealEstate10K (RE10K) and the ACID dataset. Training resolution is 256×256256 \times 256 with 16, 20, or 24 input views (static, calibrated scenes).

Performance, under varying primitive-count budgets (percentage of total pixel-aligned Gaussians), is summarized below:

Budget (% pixel-aligned) PSNR (dB) SSIM LPIPS Baseline Performance
5% (24 views) ~24.7 0.82 0.18 Baselines collapse <10 dB
40% ~25.1 Outpaces baselines by >5 dB
ACID, 40% (zero-shot) ~24.02 Best prior: 24.40 with 2× G

Ablation studies reveal that omitting PGT or IGF results in severe performance degradation at low budgets (e.g., removal of IGF collapses to ~6.5 dB PSNR at 5%). The importance-aware opacity loss Lio\mathcal{L}_{io} and PLGC sampling are critical to maintaining stability and rendering quality (Park et al., 21 Dec 2025).

A plausible implication is that the dual-stage approach, especially IGF and PLGC, is necessary for effective scaling of 3DGS to strict memory and compute regimes.

6. Limitations and Future Research

The current design applies to static scenes and requires calibrated multi-view input. EcoSplat does not natively support dynamic or deforming scenes or unknown camera motion. The authors propose as future research the extension of the importance-aware, KK-conditioned 3DGS strategy to dynamic 4D settings. This would likely involve temporal cross-attention and trajectory-level modeling (referencing related ideas such as those in SpatialTrackerV2), and per-frame adaptation of dynamic/static component separation (Park et al., 21 Dec 2025).

7. Context Within Feed-forward 3DGS

EcoSplat represents the first feed-forward architecture enabling direct specification of output primitive cardinality with stable, high-fidelity rendering—even under extremely compressed representational budgets. It outperforms state-of-the-art methods such as AnySplat, WorldMirror, and GGN under strict constraints, often matching quality with 10× fewer primitives. Its importance-aware selection pipeline and progressive compaction training establish new standards for efficiency-controllable 3DGS and provide a foundation for downstream applications requiring rapid, resource-aware scene reconstruction.

For static multi-view scenarios, EcoSplat’s approach offers a reproducible and principled method for balancing quality and efficiency in 3D representation and differentiable graphics (Park et al., 21 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to EcoSplat.