Feed-forward 3D Gaussian Splatting (3DGS)

Updated 24 December 2025

Feed-forward 3DGS is a neural representation that predicts anisotropic 3D Gaussian primitives in a single pass, enabling efficient real-time scene reconstruction and novel view synthesis.
It employs diverse architectural paradigms, including pixel-aligned, voxel-aligned, and off-the-grid methods, to optimize 3D geometry quality and rendering performance.
Recent advances in adaptive density control, compression, and pose-free inference enhance 3DGS scalability and robustness across various computer vision and graphics applications.

Feed-forward 3D Gaussian Splatting (3DGS) is a neural representation and rendering framework that enables real-time, high-fidelity 3D scene reconstruction and novel view synthesis through the direct, single-pass prediction of collections of anisotropic 3D Gaussian primitives from image data. Contrasted with optimization-based 3DGS (where model parameters undergo iterative, scene-specific adjustment), feed-forward methods perform all scene geometry, appearance, and rendering attribute inference efficiently and without per-scene optimization. This paradigm underpins significant recent advances in the efficiency, scalability, and practicality of neural 3D reconstruction for diverse application regimes in computer vision, graphics, and beyond.

1. Parametric Foundation and Pipeline Overview

Feed-forward 3DGS represents a scene as an unordered set of anisotropic 3D Gaussians, where each primitive is parameterized by position $\mu_i \in \mathbb{R}^3$ , covariance $\Sigma_i \in \mathbb{R}^{3 \times 3}$ (typically factored as scale and rotation: $s_i, R_i$ ), opacity $\alpha_i \in [0,1]$ , and typically spherical harmonic color coefficients $c_i$ for view-dependent appearance. The spatial density at a point $x$ is commonly expressed as

$G_i(x) = \exp\left( -\tfrac{1}{2}(x-\mu_i)^T \Sigma_i^{-1} (x-\mu_i) \right)$

Rendering proceeds through the projection (splatting) of these 3D ellipsoids onto the image plane, where their contributions are composited in depth order using a volumetric emission-absorption model. Each Gaussian projects to a 2D ellipse, and front-to-back alpha compositing combines per-pixel occlusion and radiance effects (Matias et al., 20 Oct 2025). Color at pixel $p$ is: $I(p) = \sum_{i=1}^M c_i \, \alpha_i \prod_{j=1}^{i-1}(1-\alpha_j)$ The feed-forward nature indicates that all Gaussian parameters—and, in recent extensions, even camera poses—are inferred in a single pass through a neural network given a set of input images (posed or unposed) (Jiang et al., 29 May 2025, Moreau et al., 17 Dec 2025). This architectural choice distinguishes feed-forward 3DGS from prior per-scene optimization methods.

2. Architectural Paradigms and Variational Alignments

The network architectures underpinning feed-forward 3DGS fall into several principal designs:

Pixel-aligned pipelines: Each output Gaussian is associated with a pixel in the input (or reference) view; for each pixel, a depth and additional geometric/color features are predicted and unprojected to construct the 3D primitive. This approach appears in pixelSplat, DepthSplat, and the baseline stages of most feed-forward pipelines. Limitations include redundancy, view bias, and misalignment errors, especially with sparse or occluded input (Wang et al., 23 Sep 2025, Moreau et al., 17 Dec 2025).
Voxel-aligned architectures: VolSplat (Wang et al., 23 Sep 2025) and related methods introduce a volumetric, voxel-aligned feature aggregation in 3D, enabling adaptively sparse, geometry-consistent Gaussian prediction. This involves constructing a 3D grid from unprojected multi-view features, applying sparse 3D U-Net volumetric refinement, and regressing per-voxel Gaussian parameters, fundamentally decoupling the number of primitives from image resolution and enabling improved multi-view consistency.
Off-the-grid detection: Recent architectures detect Gaussians at sub-pixel, adaptive locations using heatmap-based keypoint detectors, followed by unprojection and parameter regression ("Off-The-Grid" (Moreau et al., 17 Dec 2025)). Such pipelines achieve high detail with significantly fewer primitives.
Pose-free and unconstrained pipelines: Some models (AnySplat (Jiang et al., 29 May 2025), PF3plat (Hong et al., 29 Oct 2024)) regress both 3D Gaussians and camera parameters—including intrinsics and extrinsics—solely from unposed image collections.

A typical feed-forward 3DGS pipeline consists of 2D (or hybrid 2D/3D) encoders (CNN or Transformer), cross-view fusion blocks (often with attention-based or cost-volume mechanisms), Gaussian parameter decoders (MLP or shallow CNNs), and a differentiable 3DGS renderer for supervision.

3. Compression and Storage-Efficient Extensions

Feed-forward 3DGS representations, while efficient to render, are inherently large due to the dense, high-dimensional parameterization of each Gaussian. Several recent frameworks introduce learned feed-forward compression for 3DGS:

Fast Compression of 3D Gaussian Splatting (FCGS): FCGS is a pioneer optimization-free compressor that processes a Gaussian scene in a single pass via a multi-path entropy module (MEM), routing geometric features through direct-quantization and color features through either an analysis/synthesis autoencoder or direct-quant, conditionally on a binary mask predicted by a small MLP. Inter- and intra-Gaussian context models—statistically modeling redundancy via a 3D spatial grid and intra-channel autoregressivity—enable arithmetic coding to achieve over $20\times$ compression with $<0.2$ dB fidelity loss and $<10$ s encoding time for $10^6$ Gaussians, outperforming per-scene optimization methods (Chen et al., 10 Oct 2024).
Long-context modeling (LocoMoco): LocoMoco leverages Morton-order serialization to expose long-range spatial correlations and processes Gaussians in large context windows ( $L=1024$ ). Attention-based transform coding and a space-channel autoregressive entropy model yield a further $\sim10\%$ BD-Rate gain over FCGS, maintaining $20\times$ compression while modeling long-range interdependence (Liu et al., 30 Nov 2025).

Both frameworks emphasize strict design separation between sensitive geometry and tolerant color channels, windowed context modeling for entropy regularization, and efficient inference/decoding rates without per-scene retraining.

4. Enhancements: Adaptive, Robust, and Specialized Feed-forward 3DGS

Recent research confronts several practical and theoretical limitations of naïve feed-forward 3DGS, introducing domain- or challenge-targeted advances:

Density control: EcoSplat implements explicit control of the primitive count by ranking predicted Gaussians for importance based on photometric and geometric variation, enabling user-specified trade-offs between quality and efficiency at inference time (Park et al., 21 Dec 2025).
Adaptive primitive placement: "Off-The-Grid" 3DGS dispenses with grid structure, detecting 3D Gaussians at continuous, entropy-adaptive positions, thus dramatically lowering primitive counts (by $5-7\times$ for the same fidelity) and avoiding grid artifacts (Moreau et al., 17 Dec 2025).
Wide-baseline and sparse-view robustness: ProSplat augments feed-forward 3DGS with a one-step diffusion refinement network, incorporating explicit reference view conditioning (MORI) and distance-weighted epipolar attention (DWEA) to mitigate blur, enhance geometric consistency, and recover fine detail in challenging wide-baseline settings (Lu et al., 9 Jun 2025).
Unposed and unfavorable camera scenarios: Models such as UFV-Splatter (Fujimura et al., 30 Jul 2025) and PF3plat (Hong et al., 29 Oct 2024) address the absence of calibration and pose/overlap constraints through geometry-aware recentering, low-rank attention fine-tuning, and learned correspondence-based pose and depth refinement. These unlock feed-forward 3DGS for in-the-wild or object-centric capture regimes.
Omnidirectional and specialized content: Approaches such as OmniSplat (Lee et al., 21 Dec 2024) adapt pixel-based feed-forward models for 360 $^\circ$ images using Yin-Yang decomposition, while FastAvatar (Liang et al., 25 Aug 2025) delivers sub-10ms human face reconstruction from a single image via residual encoding over a data-driven 3DGS template.

5. Quantitative Performance and Empirical Trends

Feed-forward 3DGS systems achieve real-time or near-instant inference, often subsecond per novel view or even per-scene. State-of-the-art methods (e.g., VolSplat, LocoMoco, EcoSplat, Off-The-Grid) report:

Method	Input/Views	#Gaussians	PSNR (dB)	SSIM	LPIPS	Time/Scene	Notable Features
VolSplat (Wang et al., 23 Sep 2025)	6	~65,000	31.30	0.941	0.075	—	Voxel-aligned, MSFT 3D U-Net
FCGS (Chen et al., 10 Oct 2024)	—	1 million	29.2	0.897	—	<10 s	Compression, 20× reduction
LocoMoco (Liu et al., 30 Nov 2025)	—	—	∼29.2	—	—	∼11 s encode	Long-context Morton, 10% lower BD-rate
ProSplat (Lu et al., 9 Jun 2025)	4	—	18.20	0.536	0.347	—	Diffusion, wide-baseline
Off-The-Grid (Moreau et al., 17 Dec 2025)	3	115k	20.12	0.6629	0.2962	—	Adaptive detection, 5×–7× fewer splats
EcoSplat (Park et al., 21 Dec 2025)	24	78k – 629k	24.72–25.11	0.822–0.835	0.183–0.164	—	$K$ -control, optimal fidelity–efficiency

These methods uniformly surpass classical grid-aligned baselines in both fidelity and resource efficiency; e.g., Off-The-Grid achieves higher PSNR with $5$– $7\times$ fewer primitives than AnySplat or pixelSplat (Moreau et al., 17 Dec 2025).

6. Current Limitations and Future Research Directions

Limitations remain in handling extremely dynamic scenes, specular or transparent materials (due to the local, low-parametric nature of Gaussian splats), and inference-time handling of large (1000 $+$ ) input views without bottlenecks (addressed in part by ZPressor (Wang et al., 29 May 2025)). Lighting is typically baked into the 3DGS representation, with emerging work on explicit BRDF parameter regression and learned radiance-transfer for secondary-ray phenomena (Matias et al., 20 Oct 2025).

Feed-forward methods still struggle with severe occlusions, missing data, or large pose errors in extremely sparse or unconstrained settings. Advances in adaptive density allocation, dynamic scene extension, hierarchical and long-range context modeling, and integration with large-scale geometric and correspondence models (VGGT, DUSt3R) are active research topics (Liu et al., 30 Nov 2025, Moreau et al., 17 Dec 2025). Self-supervised refinement loops (as in "Off-The-Grid") that improve camera and geometry accuracy without labels offer promising avenues for foundation model pretraining.

In summary, feed-forward 3DGS constitutes a scalable, efficient, and extensible paradigm for single-pass 3D scene representation and synthesis, with continual innovations in representation sparsity, entropy modeling, adaptive placement, robustness, and compression. It forms a foundational component for real-time, photorealistic neural rendering and reconstruction systems across both constrained and unconstrained environments (Chen et al., 10 Oct 2024, Wang et al., 23 Sep 2025, Liu et al., 30 Nov 2025, Moreau et al., 17 Dec 2025, Park et al., 21 Dec 2025, Lu et al., 9 Jun 2025).