Voxel-Aligned 3D Gaussian Splatting

Updated 27 January 2026

The paper presents a novel voxel-aligned approach that replaces pixel-level Gaussian parameter prediction with direct voxel grid feature fusion to mitigate view-dependent artifacts.
It integrates a sparse 3D U-Net and an MLP decoder to refine voxel features and extract 18 Gaussian parameters, yielding enhanced geometric and color fidelity.
The method achieves state-of-the-art benchmarks across diverse datasets by leveraging adaptive Gaussian allocation and an end-to-end training strategy with photometric and perceptual losses.

Voxel-aligned feed-forward 3D Gaussian splatting (as instantiated by VolSplat (Wang et al., 23 Sep 2025)) is a methodology for multi-view 3D scene reconstruction and novel view synthesis that replaces pixel-aligned 3D Gaussian parameter prediction with direct voxel-aligned prediction from a shared, multi-view-consistent, sparse 3D feature grid. In contrast to previous paradigms that operate at the 2D pixel level and suffer from view-dependent artifacts, VolSplat establishes a framework for end-to-end feed-forward 3DGS that integrates robust feature fusion, volumetric refinement, and adaptive, data-driven allocation of 3D Gaussians, yielding state-of-the-art performance across key metrics and datasets.

1. Network Architecture and Pipeline

The VolSplat pipeline ingests $N$ RGB images $\{I_1, ..., I_N\}$ along with their camera intrinsics $K_i$ and extrinsics $(R_i, T_i)$ . The architecture proceeds in multiple stages:

2D Feature Extraction and Fusion:

Each view’s image is processed by a shared ResNet backbone producing feature maps $F_i^{mono} \in \mathbb{R}^{H/p \times W/p \times C}$ (with $p$ as the stride). Cross-view fusion is implemented via local window attention (Swin Transformer) across the nearest two views, generating multi-view-aware features $F_i$ .

Plane-Sweep Cost Volume and Depth Regression:

For view $i$ and $D$ candidate depths $\{d_1, ..., d_D\}$ , neighbor features are warped to these candidates and dot-product similarities computed to assemble plane-sweep cost volumes $C_i \in \mathbb{R}^{H/p \times W/p \times D}$ . A compact 2D CNN (Depth Prediction Module) fuses $F_i^{mono}$ and $C_i$ and regresses a dense depth map $D_i \in \mathbb{R}^{H \times W}$ .

Lifting to Voxel Grid:

Each pixel $(u,v)$ in view $i$ is unprojected via

$P_i(u,v) = R_i [ D_i(u,v) K_i^{-1} [u,v,1]^T ] + T_i.$

Associated features are aggregated into 3D voxels of side $v_s$ . For voxel $(i, j, k)$ , all features are averaged:

$V_{i,j,k} = \frac{1}{|S_{i,j,k}|} \sum_{p \in S_{i,j,k}} f_p,$

yielding a sparse voxel grid $V \in \mathbb{R}^{n_{vox} \times C}$ .

Sparse 3D U-Net Refinement:

A hierarchical encoder-decoder, utilizing sparse 3D convolutions and skip connections (cf. Çiçek et al.'s 3D U-Net), refines $V$ through a residual field $R \in \mathbb{R}^{n_{vox} \times C}$ , producing $V' = V + R$ .

Gaussian Parameter Decoding:

For each occupied voxel $v$ , a small MLP predicts 18 parameters

$[\bar{\mu}_v (3), \bar{\alpha}_v (1), \Sigma_v (6), c_v (8)] \in \mathbb{R}^{18}$

where $\Sigma_v$ is a symmetric positive-definite covariance matrix (parametrized by its six unique entries), $c_v$ are spherical harmonic color coefficients, and $\alpha_v$ is the opacity. Raw outputs are mapped to rendering parameters via

$\mu_v = \text{Centre}_v + r \cdot \sigma(\bar{\mu}_v), \quad \alpha_v = \sigma(\bar{\alpha}_v)$

with $\sigma(\cdot)$ the sigmoid, $r$ typically $3v_s$ , and voxel centers exactly given by grid indices.

2. Mathematical Formulation

The resulting set of $N_v$ Gaussians

$G = \{ (\mu_i, \Sigma_i, \alpha_i, c_i) \}_{i=1}^{N_v}$

provides a continuous density field

$\sigma(\mathbf{x}) = \sum_{i=1}^{N_v} \alpha_i \cdot \mathcal{N}(\mathbf{x}\mid \mu_i, \Sigma_i)$

and an emitted radiance function for ray direction $\mathbf{d}$ ,

$C(\mathbf{x}, \mathbf{d}) = \sum_{i=1}^{N_v} \alpha_i \cdot \mathcal{N}(\mathbf{x}\mid \mu_i, \Sigma_i) \cdot [ h(\mathbf{d}) \cdot c_i ]$

where $h(\mathbf{d})$ are order-2 spherical harmonic basis functions and $c_i$ are 8-dimensional learned color coefficients.

Volume rendering is performed along rays $r(t) = o + t\mathbf{d}$ by transmittance

$T(t) = \exp{\left(-\int_0^t \sigma(r(s)) ds\right)}, \quad C_{ray} = \int_{0}^{\infty} T(t) \sigma(r(t)) C(r(t), \mathbf{d}) dt,$

implemented via analytic integration of anisotropic Gaussians or discrete accumulation along samples.

3. Voxel-Alignment Versus Pixel-Alignment Paradigms

Paradigm	Alignment Error	Multi-view Consistency
Pixel-aligned	High, due to depth prediction noise and occlusions; view-biased density and "floaters" common	Features are 2D-matched and per-pixel, leading to view bias
Voxel-aligned	Low; exact mapping from voxel index to world coordinates, error-prone 2D matching eliminated	Features from all views are volumetrically averaged in the same cell, enforcing robust consistency

Pixel-aligned 3D Gaussian Splatting (3DGS) predictors assign one 3D Gaussian per pixel per view, with Gaussian center determined by depth regression and camera parameters. The number of Gaussians scales with image resolution and input views, affecting model dependence on input coverage. Alignment errors are incurred due to noisy depth predictions $\Delta d$ , imperfect 2D feature matching, and ambiguities from occlusions or low-texture regions.

Voxel-aligned prediction (VolSplat) fuses all input views into a canonical 3D grid before any Gaussian parameters are estimated. Multi-view consistency is enforced at the volumetric level. The grid-to-world coordinate mapping is exact, with features from every view averaged prior to 3DGS parameter synthesis, thus mitigating view bias, floaters, and misalignment artifacts.

4. Training Procedures and Objectives

VolSplat is trained in an end-to-end fashion using ground-truth images $\{I_{gt}^{(m)}\}_{m=1}^M$ at novel camera poses. The total loss is

$\mathcal{L} = \sum_{m=1}^{M} \left[ \mathcal{L}_{MSE} (I_{render}^{(m)}, I_{gt}^{(m)}) + \lambda \cdot \mathcal{L}_{LPIPS} (I_{render}^{(m)}, I_{gt}^{(m)}) \right], \quad \lambda = 0.05.$

No explicit regularization of $\Sigma_i$ or $c_i$ was required, with empirical training yielding well-conditioned covariances and view-consistent color representations under photometric and perceptual supervision alone.

5. Benchmark Results and Empirical Evaluation

Across major multi-view synthesis datasets (RealEstate10K, ScanNet, ACID zero-shot transfer), VolSplat achieves superior quantitative metrics compared to pixel-aligned and prior voxel-aligned baselines. Models are evaluated with 6 input views at $256 \times 256$ resolution. Metrics reported are PSNR (higher is better), SSIM (higher is better), LPIPS (lower is better), PGS (average Gaussians per scene).

Dataset	Model	PSNR	SSIM	LPIPS	PGS
RealEstate10K	pixelSplat	26.09	0.863	0.136	196,608
	MVSplat	26.39	0.869	0.128	65,536
	TranSplat	26.69	0.875	0.125	65,536
	DepthSplat	27.47	0.889	0.114	65,536
	GGN	26.18	0.825	0.154	9,375
	VolSplat	31.30	0.941	0.075	65,529
ScanNet	FreeSplat	27.45	0.829	0.222	63,668
	FreeSplat++	27.45	0.829	0.223	69,569
	VolSplat	28.41	0.906	0.127	65,406
ACID (zero-shot)	DepthSplat	28.37	0.847	0.141	-
	VolSplat	32.65	0.932	0.092	-

Qualitative reconstructions by VolSplat demonstrate reduced floaters, sharper geometric details, and enhanced fidelity. Adaptive voxel-based allocation concentrates Gaussians in high-frequency regions and corners, sparsifying planar regions.

6. Limitations and Prospective Improvements

Current limitations include a memory-resolution trade-off: decreasing voxel side $v_s$ increases detail but grows memory and grid size cubically. For $v_s=0.1$ m, GPU memory usage reaches 8 GB (A100-class hardware). The present model is static in time; dynamic or deformable scene modeling would necessitate time-varying voxel grids or explicit motion modeling.

Potential future research avenues include:

Hierarchical or octree-based voxel grids for memory-efficient fine-scale representation.
Learned adaptive voxel sizing to focus detail where most needed.
Semantic or instance-level integration into voxel feature representations.
Extensions for dynamic scenes through space–time voxel alignment.
Real-time updates, relevant for robotics/AR, by enabling streaming updates to the voxel U-Net.

A plausible implication is that the voxel-aligned paradigm generalizes robustly to new scenes and imaging modalities due to direct volumetric feature fusion and consistent 3D spatial correspondence. The framework provides denser, more plausible reconstructions and establishes a scalable foundation for feed-forward 3D scene modeling, multi-view rendering, and potential downstream applications in robotics and augmented reality (Wang et al., 23 Sep 2025).

Markdown Report Issue Upgrade to Chat

References (1)

VolSplat: Rethinking Feed-Forward 3D Gaussian Splatting with Voxel-Aligned Prediction (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Voxel-Aligned Feed-Forward 3D Gaussian Splatting.

Voxel-Aligned 3D Gaussian Splatting

1. Network Architecture and Pipeline

2. Mathematical Formulation

3. Voxel-Alignment Versus Pixel-Alignment Paradigms

4. Training Procedures and Objectives

5. Benchmark Results and Empirical Evaluation

6. Limitations and Prospective Improvements

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Voxel-Aligned 3D Gaussian Splatting

1. Network Architecture and Pipeline

2. Mathematical Formulation

3. Voxel-Alignment Versus Pixel-Alignment Paradigms

4. Training Procedures and Objectives

5. Benchmark Results and Empirical Evaluation

6. Limitations and Prospective Improvements

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research