Pixel-Aligned Gaussian Training (PGT)

Updated 25 December 2025

Pixel-aligned Gaussian Training (PGT) is a technique that initializes 3D Gaussian Splatting by densely predicting per-pixel 3D Gaussian parameters from calibrated multi-view images.
It employs a shallow CNN encoder and a Vision Transformer with cross-attention to fuse multi-view features, generating up to 65,536 primitives per 256×256 image.
PGT serves as a robust warm-up for subsequent finetuning, enabling differentiable splatting and high-quality novel view synthesis despite lacking inherent efficiency control.

Pixel-aligned Gaussian Training (PGT) is the initial stage of the EcoSplat framework for feed-forward 3D Gaussian Splatting (3DGS) from multi-view images. PGT optimizes a deep neural network to predict dense, per-pixel 3D Gaussian primitives from N calibrated input views, providing a strong initialization for subsequent efficiency-controllable scene reconstruction and novel view synthesis. At the culmination of PGT, the model is capable of predicting one 3D Gaussian per input pixel in each view, resulting in a dense, pixel-aligned representation that is central to EcoSplat's two-stage pipeline (Park et al., 21 Dec 2025).

1. Role of PGT in the EcoSplat Pipeline

EcoSplat employs a two-stage training protocol. The first stage, Pixel-aligned Gaussian Training (PGT), serves as a warm-up phase where the model is trained to regress 3D Gaussian parameters—center, covariance, color, and opacity—for every input pixel in each calibrated view. This dense mapping results in $N \cdot H \cdot W$ Gaussian primitives from $N$ input images, where $H$ and $W$ are the image height and width.

PGT is followed by Importance-aware Gaussian Finetuning (IGF), which introduces efficiency control by conditioning on a target primitive count $K$ , freezing the backbone and center-head, and learning to suppress less important Gaussians via their opacities. PGT, therefore, imparts the model with the capability to render novel views via differentiable splatting but lacks mechanisms for pruning, ranking, or primitive budget control (Park et al., 21 Dec 2025).

2. Network Architecture for Pixel-aligned Gaussian Training

The PGT architecture processes $N$ calibrated input views $\{I_i\}_{i=1}^N$ of size $H \times W \times 3$ using the following components:

Shallow CNN Encoder $\psi$ : Each image $I_i$ is processed through a single convolutional layer to yield low-level features $\psi(I_i) \in \mathbb{R}^{H \times W \times C_0}$ .
Tokenization and Vision Transformer (ViT) Encoder: Each pixel is tokenized, resulting in $P = H \cdot W$ tokens per image. Tokens from each view pass through a shared ViT encoder, producing per-view sequences of $P$ tokens with embedding dimension $D$ .
ViT Decoder (Cross-attention Blocks): Token sequences from all views are concatenated and undergo $m$ transformer decoder blocks with cross-attention, enabling multi-view feature fusion. For each layer $\ell$ , decoded tokens are $Z_i^{(\ell)} \in \mathbb{R}^{P \times C_\ell}$ .
Prediction Heads:
- Gaussian-centre Head $F_\mu$ : Maps decoded tokens to 3D Gaussian centers, outputting $\{\mu_{i,j}\}_{j=1}^{HW}$ , where $\mu_{i,j} \in \mathbb{R}^3$ .
- Gaussian-parameter Head $F_\nu$ : Consumes both decoded tokens and CNN features to regress per-pixel opacity $\alpha_{i,j}$ , covariance matrix $\Sigma_{i,j} \in \mathbb{R}^{3 \times 3}$ , and color $c_{i,j} \in \mathbb{R}^3$ .

Equation (1) in the paper formalizes this process:

$\begin{aligned} \{\mu_{i,j}\}_{j=1}^{HW} &= F_\mu\bigl(\{Z_i^{(\ell)}\}_{\ell=1}^m\bigr)\;, \ \{[\alpha_{i,j}; \Sigma_{i,j}; c_{i,j}]\}_{j=1}^{HW} &= F_\nu\bigl(\{Z_i^{(\ell)}\}_{\ell=1}^m,\,\psi(I_i)\bigr)\;. \end{aligned}$

Each primitive $G_{i,j}$ is thus defined as

$G_{i,j} = (\mu_{i,j},\,\Sigma_{i,j},\,c_{i,j},\,\alpha_{i,j}).$

3. Data Flow: Mapping Multi-view Images to 3D Gaussians

The data flow in PGT is characterized by strict pixel alignment at each processing stage. The workflow can be summarized as follows:

Each input image $I_i$ is divided into $P=H \cdot W$ tokens, each aligned with a pixel.
The tokenized representations from all $N$ images pass through the shared ViT encoder and are processed by $m=4$ cross-attention decoder blocks, with per-token channel dimensions $C_\ell$ for each layer.
The low-level CNN feature map $\psi(I_i)$ is incorporated into $F_\nu$ to provide local appearance context.
Final per-pixel Gaussian parameters are predicted through shallow MLPs or $1\times1$ convolutions—these leverage the pixel-aligned tokens, obviating the need for complex aggregation.
For $256 \times 256$ images, this results in $65,536$ tokens (primitives) per view (Park et al., 21 Dec 2025).
All predicted Gaussians across views are aggregated for downstream differentiable rendering.

4. Training Loss and Optimization Protocol

PGT is trained exclusively with a rendering loss supervised by multi-view consistency. All $N \cdot H \cdot W$ Gaussians generated across views are rendered into a set of held-out target novel views via differentiable splatting (using the open-source gsplat engine).

The rendering loss function combines mean squared error (MSE) and LPIPS, as given by equation (2):

$\mathcal{L}_{\mathrm{render}} = \frac{1}{N^{\mathrm{tgt}}} \sum_{p=1}^{N^{\mathrm{tgt}}} \Big[ \mathcal{L}_{\mathrm{MSE}}\left(I^{\mathrm{tgt}}_p,\,\hat I^{\mathrm{tgt}}_p\right) + 0.05\,\mathcal{L}_{\mathrm{LPIPS}}\left(I^{\mathrm{tgt}}_p,\,\hat I^{\mathrm{tgt}}_p\right) \Big],$

where $I^{\mathrm{tgt}}_p$ and $\hat I^{\mathrm{tgt}}_p$ are ground-truth and rendered target images, respectively. The loss incorporates no additional regularizers. $\mathcal{L}_{\mathrm{LPIPS}}$ references the learned perceptual similarity metric of Zhang et al.

The training schedule comprises 200,000 Adam iterations (batch size ≈ 16 sequences) on 4×A100 GPUs, with learning rates and hyperparameters mirroring the MASt3R fine-tuning protocol.

5. Practical Considerations and Implementation

The following table summarizes key implementation details as reported in (Park et al., 21 Dec 2025):

Component	Setting	Notes
Image Resolution	$256 \times 256$	65,536 tokens per view
CNN $\psi$	3×3 conv, $C_0$ channels	Single layer
ViT decoder blocks ( $m$ )	4	With cross-attention
Initialization	ViT, $F_\mu$ : MASt3R pre-trained	Others: from scratch
Renderer	gsplat engine (Ye et al. 2025)	Differentiable splatting
Training Length	200,000 iterations, batch size $\approx$ 16	4×A100 GPUs
Optimization	AdamW, lr= $1\times10^{-4}$ (MASt3R recipe)

The model’s weights for the ViT encoder/decoder and prediction heads are trained from scratch, except for the ViT and $F_\mu$ , which are initialized from MASt3R pre-trained weights. This preserves multi-view consistency and leverages prior representations.

6. Context, Significance, and Limitations

Pixel-aligned Gaussian Training is a foundational component enabling the purely feed-forward 3DGS paradigm in EcoSplat. By providing dense initial coverage, PGT positions the model for subsequent efficiency control and pruning in IGF, which is critical in dense-view settings where primitive budget is a practical constraint. PGT itself does not incorporate mechanisms for primitive importance or efficiency control; it produces maximal, unpruned representations driven by rendering fidelity.

A plausible implication is that the strict pixel-alignment and dense prediction strategy, while computationally intensive in isolation, is essential for initializing downstream selection, pruning, and adaptive compression phases—tasks for which direct end-to-end learning without a dense warm-up has demonstrated suboptimality in prior works.

PGT exclusively supervises rendering quality, with no explicit geometric or regularization losses, emphasizing end-task performance in novel view synthesis rather than intermediate geometric accuracy. This approach aligns with contemporary trends in neural rendering but may limit interpretability and direct geometric supervision capabilities.

7. Relation to Prior Art and Research Directions

PGT in EcoSplat builds directly atop the architectures and initialization regimes established in MASt3R, employing pre-trained weights for improved convergence. The use of pixel-aligned tokens, vision transformer encoders/decoders with cross-attention, and shallow parameter heads reflects an architectural lineage from multi-view transformer-based 3D reconstruction.

Conventional feed-forward 3DGS models either lack primitive economy or do not support explicit efficiency control. PGT’s pixel-aligned representation sidesteps per-scene optimization but necessitates the subsequent IGF stage to yield usable representations under strict primitive-count constraints. This suggests continued research interest in strategies for jointly optimizing density, importance, and compositionality of basis functions for 3D neural rendering.

Overall, Pixel-aligned Gaussian Training constitutes the crucial initialization phase for controllable, efficient 3DGS pipelines, enabling advances in scalable and practical neural scene rendering (Park et al., 21 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

EcoSplat: Efficiency-controllable Feed-forward 3D Gaussian Splatting from Multi-view Images (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Pixel-aligned Gaussian Training (PGT).