EcoSplat: Controllable 3D Gaussian Splatting

Updated 25 December 2025

The paper introduces EcoSplat, an efficient feed-forward 3D Gaussian Splatting framework that allows explicit control over the number of output primitives through a dual-stage training process.
It employs Pixel-aligned Gaussian Training and Importance-aware Gaussian Finetuning to suppress redundant primitives and maintain high rendering fidelity under tight efficiency constraints.
Empirical evaluations on datasets like RealEstate10K and ACID demonstrate that EcoSplat achieves state-of-the-art quality with up to 10× fewer primitives compared to existing methods.

EcoSplat is an efficiency-controllable feed-forward 3D Gaussian Splatting (3DGS) framework designed for adaptive scene reconstruction and novel view synthesis from multi-view images. It offers explicit control over the number of 3D Gaussian primitives (“K”) output at inference, addressing the inefficiency of prior feed-forward 3DGS methods that predict pixel-aligned primitives per-view and cannot regulate primitive count, especially in dense-view settings. The approach comprises two coupled training stages—Pixel-aligned Gaussian Training (PGT) and Importance-aware Gaussian Finetuning (IGF)—followed by a principled selection and filtering mechanism at inference to obtain high-fidelity, compact 3D representations suitable for flexible rendering tasks (Park et al., 21 Dec 2025).

1. Foundational Concepts and Motivation

Feed-forward 3D Gaussian Splatting (3DGS) generates 3D scene representations by predicting anisotropic 3D Gaussian primitives from multiple calibrated RGB views, enabling efficient, one-pass reconstruction without per-scene optimization. Prior architectures produce pixel-aligned primitives for each view, leading to excessive redundancy when aggregating across views, particularly as the number of input images grows. Existing baselines lack mechanisms for directly constraining the number of predicted Gaussians, resulting in memory and compute inefficiencies or uncontrolled quality degradation when attempting to prune or merge primitives.

EcoSplat introduces the first explicit, user-controlled mechanism for specifying the number of output primitives (“K”) via adaptive network conditioning, supervised importance signals, and architectural modifications, delivering state-of-the-art fidelity under tight efficiency budgets. The method is suitable for static scenes with calibrated multi-view input.

2. Two-Stage Training Procedure

2.1 Pixel-aligned Gaussian Training (PGT)

The PGT stage predicts Gaussian primitives directly from multi-view input images. Each of the $N$ views $\{I_i\}_{i=1}^N$ (of spatial size $H\times W$ pixels) is processed by a shared Vision Transformer (ViT) encoder and a cross-view decoder. For each pixel in every view, the network predicts:

Center $\mu_{i,j}\in\mathbb{R}^3$
Covariance $\Sigma_{i,j}\in\mathbb{R}^{3\times3}$
Color $c_{i,j}\in\mathbb{R}^3$
Opacity $\alpha_{i,j}$

Specifically,

$\{\mu_{i,j}\}_{j=1}^{HW} = F_{\mu}(Z_i^{(\ell)}) \in\mathbb{R}^{HW\times 3}$

$\{[\alpha_{i,j};\,\Sigma_{i,j};\,c_{i,j}]\}_{j=1}^{HW} = F_\nu(\{Z_i^{(\ell)}\}, \psi(I_i))$

with $Z_i^{(\ell)}$ tokenized ViT outputs and $\psi(I_i)$ shallow-CNN features.

Supervision occurs via differentiable splatting and photometric loss on held-out novel target views: $\mathcal{L}_{\rm render} = \frac{1}{N^{\rm tgt}} \sum_{p=1}^{N^{\rm tgt}} \Big[\mathcal{L}_{\rm MSE}(I_p^{\rm tgt}, \hat{I}_p^{\rm tgt}) + 0.05\,\mathcal{L}_{\rm LPIPS}(I_p^{\rm tgt}, \hat{I}_p^{\rm tgt})\Big].$

2.2 Importance-aware Gaussian Finetuning (IGF)

To enable primitive-count control, a second stage reuses the pretrained ViT and center head but injects a target count $K$ via a learnable “importance embedding” $R_i$ in the parameter head $F_\nu$ . The critical innovation is the suppression of low-importance Gaussians by adjusting their opacities $\tilde{\alpha}_{i,j}$ so that the $K$ most significant primitives can be selected at inference while maintaining image fidelity.

A global preservation ratio $\rho = K/(N H W)$ is computed and broadcast via a CNN to generate embedding $R_i$ , allowing the network to adapt prediction density. Mask generation for supervision uses combined photometric (image gradients) and geometric (surface normal gradients) variation:

Photometric: $g_{\rm photo},i}(x,y) = \sqrt{\|\nabla_x I_i\|_2^2 + \|\nabla_y I_i\|_2^2}$
Geometric: $g_{\rm geo},i} = \sqrt{\|\nabla_x n_i\|_2^2 + \|\nabla_y n_i\|_2^2}$
Importance mask $g_i = (g_{\rm photo},i + g_{\rm geo},i)/2$

High-variation Gaussians are retained directly, while low-variation regions are compacted via K-means within $4\times 4$ spatial patches. Binary masks $\Omega_i$ supervise importance-aware opacity via binary cross-entropy loss: $\mathcal{L}_{io} = \lambda_{io} \frac{1}{N H W} \sum_{i=1}^N \sum_{j=1}^{HW} \mathrm{BCE}(\Omega_{i,j}, \tilde{\alpha}_{i,j}).$ With $\lambda_{io}=0.1$ , plus a $K$ -constrained rendering loss using only the top $K$ Gaussians: $\mathcal{L} = \mathcal{L}_{io} + \mathcal{L}_{K\text{-render}}.$ To enhance robustness, Progressive Learning on Gaussian Compaction (PLGC) samples $K$ per batch from $[K_{\text{min}}, K_{\text{max}}]$ : $K_{\max}=0.95 N H W\,,\quad K_{\min} = \max(0.85-\lambda_{\rm decay}\lfloor t/S\rfloor,\,0.05) \times N H W.$

3. Inference-time Primitive Count Control

At test time, primitive selection is modulated as follows:

Compute the global ratio $ρ = K/(N H W)$ .
For each input view $i$ , compute a high-frequency score $\eta_i$ using the 2D FFT of $I_i$ ,

$\eta_i = 1 - \frac{\sum_{\xi\in\Lambda} |\mathrm{FFT}(I_i)|(\xi)}{\sum_{\xi} |\mathrm{FFT}(I_i)|(\xi)}$

where $\Lambda$ is the low-frequency band.

Distribute the overall budget using softmax with temperature $T=0.2$ : $\Psi_i = \exp(\eta_i/T)/\sum_q \exp(\eta_q/T)$ , then allocate $\rho_i = N \Psi_i \rho$ per view.
Via $F_\nu$ and injection of $R_i$ (updated for each $\rho_i$ ), predict Gaussians and select the top $\lceil\rho_i H W\rceil$ per view, ranked by adjusted opacity $\tilde{\alpha}$ .
Aggregate across views and adjust to ensure the total output primitive count is precisely $K$ .

This mechanism yields a globally uniform or adaptively non-uniform distribution of spatial detail, providing real-time rendering capability and direct user control over resource-accuracy tradeoffs.

4. Architectural and Training Details

The architecture comprises:

Shallow CNN ( $\psi$ ) for initial image features.
Shared ViT encoder (patch tokens, e.g., $16\times16$ patches).
$m=4$ ViT decoder layers with cross-attention for multi-view fusion.
Heads: $F_{\mu}$ (MLP for 3D center prediction), $F_\nu$ (MLP for opacity, covariance, color prediction with $R_i$ injection).
“Shallow Add” injection strategy: $R_i$ is added after convolutional refinement to spatial tokens.
Training: Both PGT and IGF stages use 200k iterations each, with ViT and $F_\mu$ initialized from MASt3R pretraining.
Implementation: Trained on 4 NVIDIA A100 GPUs (40GB), using the gsplat rasterization backend.

Key hyperparameters include $λ_{io}=0.1$ , learning rate decay $λ_{\rm decay}=0.05$ , sampling interval $S=1000$ , inference softmax temperature $T=0.2$ .

5. Empirical Performance and Evaluation

EcoSplat was validated on RealEstate10K (RE10K) and the ACID dataset. Training resolution is $256 \times 256$ with 16, 20, or 24 input views (static, calibrated scenes).

Performance, under varying primitive-count budgets (percentage of total pixel-aligned Gaussians), is summarized below:

Budget (% pixel-aligned)	PSNR (dB)	SSIM	LPIPS	Baseline Performance
5% (24 views)	~24.7	0.82	0.18	Baselines collapse <10 dB
40%	~25.1	—	—	Outpaces baselines by >5 dB
ACID, 40% (zero-shot)	~24.02	—	—	Best prior: 24.40 with 2× G

Ablation studies reveal that omitting PGT or IGF results in severe performance degradation at low budgets (e.g., removal of IGF collapses to ~6.5 dB PSNR at 5%). The importance-aware opacity loss $\mathcal{L}_{io}$ and PLGC sampling are critical to maintaining stability and rendering quality (Park et al., 21 Dec 2025).

A plausible implication is that the dual-stage approach, especially IGF and PLGC, is necessary for effective scaling of 3DGS to strict memory and compute regimes.

6. Limitations and Future Research

The current design applies to static scenes and requires calibrated multi-view input. EcoSplat does not natively support dynamic or deforming scenes or unknown camera motion. The authors propose as future research the extension of the importance-aware, $K$ -conditioned 3DGS strategy to dynamic 4D settings. This would likely involve temporal cross-attention and trajectory-level modeling (referencing related ideas such as those in SpatialTrackerV2), and per-frame adaptation of dynamic/static component separation (Park et al., 21 Dec 2025).

7. Context Within Feed-forward 3DGS

EcoSplat represents the first feed-forward architecture enabling direct specification of output primitive cardinality with stable, high-fidelity rendering—even under extremely compressed representational budgets. It outperforms state-of-the-art methods such as AnySplat, WorldMirror, and GGN under strict constraints, often matching quality with 10× fewer primitives. Its importance-aware selection pipeline and progressive compaction training establish new standards for efficiency-controllable 3DGS and provide a foundation for downstream applications requiring rapid, resource-aware scene reconstruction.

For static multi-view scenarios, EcoSplat’s approach offers a reproducible and principled method for balancing quality and efficiency in 3D representation and differentiable graphics (Park et al., 21 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

EcoSplat: Efficiency-controllable Feed-forward 3D Gaussian Splatting from Multi-view Images (2025)

EcoSplat: Controllable 3D Gaussian Splatting

1. Foundational Concepts and Motivation

2. Two-Stage Training Procedure

2.1 Pixel-aligned Gaussian Training (PGT)

2.2 Importance-aware Gaussian Finetuning (IGF)

3. Inference-time Primitive Count Control

4. Architectural and Training Details

5. Empirical Performance and Evaluation

6. Limitations and Future Research

7. Context Within Feed-forward 3DGS

Whiteboard

Follow Topic

Continue Learning

EcoSplat: Controllable 3D Gaussian Splatting

1. Foundational Concepts and Motivation

2. Two-Stage Training Procedure

2.1 Pixel-aligned Gaussian Training (PGT)

2.2 Importance-aware Gaussian Finetuning (IGF)

3. Inference-time Primitive Count Control

4. Architectural and Training Details

5. Empirical Performance and Evaluation

6. Limitations and Future Research

7. Context Within Feed-forward 3DGS

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics