Papers
Topics
Authors
Recent
Search
2000 character limit reached

Voxel-Aligned 3D Gaussian Splatting

Updated 27 January 2026
  • The paper presents a novel voxel-aligned approach that replaces pixel-level Gaussian parameter prediction with direct voxel grid feature fusion to mitigate view-dependent artifacts.
  • It integrates a sparse 3D U-Net and an MLP decoder to refine voxel features and extract 18 Gaussian parameters, yielding enhanced geometric and color fidelity.
  • The method achieves state-of-the-art benchmarks across diverse datasets by leveraging adaptive Gaussian allocation and an end-to-end training strategy with photometric and perceptual losses.

Voxel-aligned feed-forward 3D Gaussian splatting (as instantiated by VolSplat (Wang et al., 23 Sep 2025)) is a methodology for multi-view 3D scene reconstruction and novel view synthesis that replaces pixel-aligned 3D Gaussian parameter prediction with direct voxel-aligned prediction from a shared, multi-view-consistent, sparse 3D feature grid. In contrast to previous paradigms that operate at the 2D pixel level and suffer from view-dependent artifacts, VolSplat establishes a framework for end-to-end feed-forward 3DGS that integrates robust feature fusion, volumetric refinement, and adaptive, data-driven allocation of 3D Gaussians, yielding state-of-the-art performance across key metrics and datasets.

1. Network Architecture and Pipeline

The VolSplat pipeline ingests NN RGB images {I1,...,IN}\{I_1, ..., I_N\} along with their camera intrinsics KiK_i and extrinsics (Ri,Ti)(R_i, T_i). The architecture proceeds in multiple stages:

  • 2D Feature Extraction and Fusion:

Each view’s image is processed by a shared ResNet backbone producing feature maps FimonoRH/p×W/p×CF_i^{mono} \in \mathbb{R}^{H/p \times W/p \times C} (with pp as the stride). Cross-view fusion is implemented via local window attention (Swin Transformer) across the nearest two views, generating multi-view-aware features FiF_i.

  • Plane-Sweep Cost Volume and Depth Regression:

For view ii and DD candidate depths {d1,...,dD}\{d_1, ..., d_D\}, neighbor features are warped to these candidates and dot-product similarities computed to assemble plane-sweep cost volumes CiRH/p×W/p×DC_i \in \mathbb{R}^{H/p \times W/p \times D}. A compact 2D CNN (Depth Prediction Module) fuses FimonoF_i^{mono} and CiC_i and regresses a dense depth map DiRH×WD_i \in \mathbb{R}^{H \times W}.

  • Lifting to Voxel Grid:

Each pixel (u,v)(u,v) in view ii is unprojected via

Pi(u,v)=Ri[Di(u,v)Ki1[u,v,1]T]+Ti.P_i(u,v) = R_i [ D_i(u,v) K_i^{-1} [u,v,1]^T ] + T_i.

Associated features are aggregated into 3D voxels of side vsv_s. For voxel (i,j,k)(i, j, k), all features are averaged:

Vi,j,k=1Si,j,kpSi,j,kfp,V_{i,j,k} = \frac{1}{|S_{i,j,k}|} \sum_{p \in S_{i,j,k}} f_p,

yielding a sparse voxel grid VRnvox×CV \in \mathbb{R}^{n_{vox} \times C}.

A hierarchical encoder-decoder, utilizing sparse 3D convolutions and skip connections (cf. Çiçek et al.'s 3D U-Net), refines VV through a residual field RRnvox×CR \in \mathbb{R}^{n_{vox} \times C}, producing V=V+RV' = V + R.

  • Gaussian Parameter Decoding:

For each occupied voxel vv, a small MLP predicts 18 parameters

[μˉv(3),αˉv(1),Σv(6),cv(8)]R18[\bar{\mu}_v (3), \bar{\alpha}_v (1), \Sigma_v (6), c_v (8)] \in \mathbb{R}^{18}

where Σv\Sigma_v is a symmetric positive-definite covariance matrix (parametrized by its six unique entries), cvc_v are spherical harmonic color coefficients, and αv\alpha_v is the opacity. Raw outputs are mapped to rendering parameters via

μv=Centrev+rσ(μˉv),αv=σ(αˉv)\mu_v = \text{Centre}_v + r \cdot \sigma(\bar{\mu}_v), \quad \alpha_v = \sigma(\bar{\alpha}_v)

with σ()\sigma(\cdot) the sigmoid, rr typically 3vs3v_s, and voxel centers exactly given by grid indices.

2. Mathematical Formulation

The resulting set of NvN_v Gaussians

G={(μi,Σi,αi,ci)}i=1NvG = \{ (\mu_i, \Sigma_i, \alpha_i, c_i) \}_{i=1}^{N_v}

provides a continuous density field

σ(x)=i=1NvαiN(xμi,Σi)\sigma(\mathbf{x}) = \sum_{i=1}^{N_v} \alpha_i \cdot \mathcal{N}(\mathbf{x}\mid \mu_i, \Sigma_i)

and an emitted radiance function for ray direction d\mathbf{d},

C(x,d)=i=1NvαiN(xμi,Σi)[h(d)ci]C(\mathbf{x}, \mathbf{d}) = \sum_{i=1}^{N_v} \alpha_i \cdot \mathcal{N}(\mathbf{x}\mid \mu_i, \Sigma_i) \cdot [ h(\mathbf{d}) \cdot c_i ]

where h(d)h(\mathbf{d}) are order-2 spherical harmonic basis functions and cic_i are 8-dimensional learned color coefficients.

Volume rendering is performed along rays r(t)=o+tdr(t) = o + t\mathbf{d} by transmittance

T(t)=exp(0tσ(r(s))ds),Cray=0T(t)σ(r(t))C(r(t),d)dt,T(t) = \exp{\left(-\int_0^t \sigma(r(s)) ds\right)}, \quad C_{ray} = \int_{0}^{\infty} T(t) \sigma(r(t)) C(r(t), \mathbf{d}) dt,

implemented via analytic integration of anisotropic Gaussians or discrete accumulation along samples.

3. Voxel-Alignment Versus Pixel-Alignment Paradigms

Paradigm Alignment Error Multi-view Consistency
Pixel-aligned High, due to depth prediction noise and occlusions; view-biased density and "floaters" common Features are 2D-matched and per-pixel, leading to view bias
Voxel-aligned Low; exact mapping from voxel index to world coordinates, error-prone 2D matching eliminated Features from all views are volumetrically averaged in the same cell, enforcing robust consistency

Pixel-aligned 3D Gaussian Splatting (3DGS) predictors assign one 3D Gaussian per pixel per view, with Gaussian center determined by depth regression and camera parameters. The number of Gaussians scales with image resolution and input views, affecting model dependence on input coverage. Alignment errors are incurred due to noisy depth predictions Δd\Delta d, imperfect 2D feature matching, and ambiguities from occlusions or low-texture regions.

Voxel-aligned prediction (VolSplat) fuses all input views into a canonical 3D grid before any Gaussian parameters are estimated. Multi-view consistency is enforced at the volumetric level. The grid-to-world coordinate mapping is exact, with features from every view averaged prior to 3DGS parameter synthesis, thus mitigating view bias, floaters, and misalignment artifacts.

4. Training Procedures and Objectives

VolSplat is trained in an end-to-end fashion using ground-truth images {Igt(m)}m=1M\{I_{gt}^{(m)}\}_{m=1}^M at novel camera poses. The total loss is

L=m=1M[LMSE(Irender(m),Igt(m))+λLLPIPS(Irender(m),Igt(m))],λ=0.05.\mathcal{L} = \sum_{m=1}^{M} \left[ \mathcal{L}_{MSE} (I_{render}^{(m)}, I_{gt}^{(m)}) + \lambda \cdot \mathcal{L}_{LPIPS} (I_{render}^{(m)}, I_{gt}^{(m)}) \right], \quad \lambda = 0.05.

No explicit regularization of Σi\Sigma_i or cic_i was required, with empirical training yielding well-conditioned covariances and view-consistent color representations under photometric and perceptual supervision alone.

5. Benchmark Results and Empirical Evaluation

Across major multi-view synthesis datasets (RealEstate10K, ScanNet, ACID zero-shot transfer), VolSplat achieves superior quantitative metrics compared to pixel-aligned and prior voxel-aligned baselines. Models are evaluated with 6 input views at 256×256256 \times 256 resolution. Metrics reported are PSNR (higher is better), SSIM (higher is better), LPIPS (lower is better), PGS (average Gaussians per scene).

Dataset Model PSNR SSIM LPIPS PGS
RealEstate10K pixelSplat 26.09 0.863 0.136 196,608
MVSplat 26.39 0.869 0.128 65,536
TranSplat 26.69 0.875 0.125 65,536
DepthSplat 27.47 0.889 0.114 65,536
GGN 26.18 0.825 0.154 9,375
VolSplat 31.30 0.941 0.075 65,529
ScanNet FreeSplat 27.45 0.829 0.222 63,668
FreeSplat++ 27.45 0.829 0.223 69,569
VolSplat 28.41 0.906 0.127 65,406
ACID (zero-shot) DepthSplat 28.37 0.847 0.141 -
VolSplat 32.65 0.932 0.092 -

Qualitative reconstructions by VolSplat demonstrate reduced floaters, sharper geometric details, and enhanced fidelity. Adaptive voxel-based allocation concentrates Gaussians in high-frequency regions and corners, sparsifying planar regions.

6. Limitations and Prospective Improvements

Current limitations include a memory-resolution trade-off: decreasing voxel side vsv_s increases detail but grows memory and grid size cubically. For vs=0.1v_s=0.1 m, GPU memory usage reaches 8 GB (A100-class hardware). The present model is static in time; dynamic or deformable scene modeling would necessitate time-varying voxel grids or explicit motion modeling.

Potential future research avenues include:

  • Hierarchical or octree-based voxel grids for memory-efficient fine-scale representation.
  • Learned adaptive voxel sizing to focus detail where most needed.
  • Semantic or instance-level integration into voxel feature representations.
  • Extensions for dynamic scenes through space–time voxel alignment.
  • Real-time updates, relevant for robotics/AR, by enabling streaming updates to the voxel U-Net.

A plausible implication is that the voxel-aligned paradigm generalizes robustly to new scenes and imaging modalities due to direct volumetric feature fusion and consistent 3D spatial correspondence. The framework provides denser, more plausible reconstructions and establishes a scalable foundation for feed-forward 3D scene modeling, multi-view rendering, and potential downstream applications in robotics and augmented reality (Wang et al., 23 Sep 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Voxel-Aligned Feed-Forward 3D Gaussian Splatting.