Feed-Forward Gaussian Splatting

Updated 4 July 2026

Feed-forward Gaussian splatting is a reconstruction pipeline that predicts scene-specific Gaussian primitives from sparse input images in a single forward pass.
The method replaces per-scene iterative optimization with a generalizable network, enabling rapid novel-view synthesis and multi-view aggregation.
Recent approaches incorporate transformer-based aggregation and condition-aware adaptation to robustly handle noise, varied lighting, and high-resolution demands.

Feed-forward Gaussian splatting denotes a class of generalizable reconstruction pipelines in which a pretrained network predicts a scene-specific set of Gaussian primitives from one or more input images in a single forward pass, rather than optimizing Gaussian parameters per scene. Across recent formulations, the predicted scene is represented as a set such as

$\mathcal{G}=\{(\boldsymbol{\mu}_j,\boldsymbol{\Sigma}_j,\mathbf{h}_j,\alpha_j)\}_{j=1}^{N_g},$

or an equivalent parameterization by position, covariance, rotation, scale, opacity, and spherical-harmonic color, and a differentiable splatting renderer produces target images by projecting these Gaussians into the image plane and compositing them along each ray (Jiang et al., 26 May 2026, Jiang et al., 29 May 2025).

1. Definition and representational core

In the feed-forward setting, the central operation is a learned mapping from sparse observations to a renderable Gaussian scene. This contrasts with classic 3DGS, where each new scene requires per-scene gradient descent. Recent feed-forward systems therefore replace scene-specific optimization with a reusable predictor that outputs Gaussian parameters directly from images, often together with depth, camera parameters, or other auxiliary scene variables (Jiang et al., 29 May 2025).

The representational core remains close to standard 3DGS. Each Gaussian typically carries a 3D center $\boldsymbol{\mu}$ , a positive-definite covariance $\boldsymbol{\Sigma}$ or its scale-rotation reparameterization, an opacity $\alpha \in (0,1)$ , and view-dependent color encoded by spherical harmonics. Rendering is performed by projecting each Gaussian into the target camera, transforming $\boldsymbol{\Sigma}$ into a 2D elliptical footprint, and using front-to-back alpha compositing,

$C_{\text{out}} = \sum_j w_j c_j,\qquad w_j = \alpha_j \prod_{k<j}(1-\alpha_k),$

or an equivalent rasterized form (Jiang et al., 26 May 2026, Zhang et al., 2024).

A major axis of variation is the camera model assumed at inference. Some methods are posed and depth-centric, taking calibrated sparse views and estimating geometry by multi-view reasoning. Others are pose-free or pose-light: AnySplat predicts camera intrinsics and extrinsics jointly with the Gaussian scene from uncalibrated image collections, WildSplatter relies on per-pixel ray origins and directions from a Depth Anything 3 backbone, and DrivingForward uses a pose network only during training and discards it at inference (Jiang et al., 29 May 2025, Fujimura et al., 23 Apr 2026, Tian et al., 2024). This establishes feed-forward Gaussian splatting not as a single architecture, but as a family of explicit radiance-field predictors with different assumptions about geometry, calibration, and appearance.

2. Architectural lineages

One dominant lineage is cost-volume-based multi-view inference. DelowlightSplat, DenoiseSplat, PanSplat, and ProSplat all follow an MVSNet-style paradigm: per-view features are extracted, warped across depth hypotheses, aggregated into a 3D cost volume, regularized by 3D CNNs or attention, and decoded into depth and Gaussian attributes (Jiang et al., 26 May 2026, Jiang et al., 10 Mar 2026, Zhang et al., 2024, Lu et al., 9 Jun 2025). Within this family, the Gaussian center is often obtained by back-projecting a pixel-depth hypothesis,

$\boldsymbol{\mu}(\mathbf{u},d)=\mathbf{o}+d\,\mathbf{r}(\mathbf{u};\mathbf{K},\mathbf{T}),$

after which scale, rotation, opacity, and SH coefficients are predicted by lightweight heads (Jiang et al., 26 May 2026).

A second lineage replaces explicit cost volumes with transformer token reasoning over views. AnySplat uses a VGGT-style geometry transformer with camera, depth, and Gaussian heads, and then applies differentiable voxelization to merge per-pixel Gaussians into a compact renderable set (Jiang et al., 29 May 2025). The "Z-Order Transformer for Feed-Forward Gaussian Splatting" serializes candidate Gaussians by Morton order, applies sparse attention over spatially coherent sequences, and uses Z-order pooling to suppress redundancy before Gaussian prediction (Wang et al., 13 May 2026). AnyCity introduces an observation-supported geometry latent $Z_{\mathrm{geo}}$ and a separate completion branch that predicts a gated residual $\Delta Z$ before Gaussian decoding, explicitly separating reliable scaffold geometry from prior-driven completion in sparse aerial scenes (Wu et al., 19 May 2026).

A third lineage emphasizes independent per-view geometry with later aggregation. DrivingForward jointly trains pose, depth, and Gaussian networks for surround-view driving scenes, predicting one Gaussian per pixel from each image and aggregating them without a global multi-view transformer at inference (Tian et al., 2024). Splat-SAP uses scale-aware point maps as an intermediate representation robust to large sparsity, then refines target-view geometry via stereo matching before anchoring Gaussian primitives on the refined plane (Zhou et al., 27 Nov 2025). These designs treat feed-forward Gaussian splatting less as direct regression from fused latent space and more as a structured lift from image-wise geometry to explicit splats.

A significant recent trend is condition-aware adaptation. DelowlightSplat targets the setting of lowlight context views and clean target views. Its Lowlight Adapter applies residual enhancement,

$\tilde I_i=\mathrm{clip}\big(I_i+\lambda\cdot\tanh(\Delta(I_i))\big),$

before any cost-volume construction, with the explicit goal of improving feature matchability rather than standalone image restoration (Jiang et al., 26 May 2026). DenoiseSplat addresses Gaussian, Poisson, speckle, and salt-and-pepper noise by training a MVSplat-style backbone end-to-end from noisy inputs to clean renderings, and adds a dual-branch Gaussian head that decouples geometry from appearance, plus a Cross-Branch Boundary-guided Correction module for uncertain depth boundaries (Jiang et al., 10 Mar 2026). In both cases, the feed-forward scene predictor is optimized to recover a clean Gaussian radiance field from degraded observations rather than reproduce the corruption.

A related strand separates geometry from appearance under severe viewpoint or illumination variation. WildSplatter explicitly disentangles geometry and appearance by predicting geometry from context images and a global appearance embedding from a reference image, then modulating Gaussian SH colors with that embedding; it reconstructs 3D Gaussians from sparse input views in under one second and enables appearance control under diverse lighting conditions (Fujimura et al., 23 Apr 2026). ProSplat uses a two-stage design for wide-baseline sparse views: a DepthSplat-based Gaussian generator first renders target views, then a one-step diffusion improvement model refines those renderings using Maximum Overlap Reference view Injection and Distance-Weighted Epipolar Attention (Lu et al., 9 Jun 2025). AnyCity extends this logic into a controlled generative regime by letting completion tokens act only as scaffold-conditioned residuals on weakly constrained content (Wu et al., 19 May 2026). This suggests a broader shift from purely deterministic Gaussian regression toward reconstruction-first systems with explicit condition adaptation and tightly bounded generative completion.

4. Task-specific expansions

Feed-forward Gaussian splatting has rapidly specialized to particular sensing regimes. PanSplat adapts the paradigm to wide-baseline panorama synthesis by replacing pixel-aligned splats with a spherical 3D Gaussian pyramid on a Fibonacci lattice, combining a hierarchical spherical cost volume with local Gaussian heads and two-step deferred backpropagation to support resolutions up to 4K (Zhang et al., 2024). The representational change is geometric as much as computational: spherical sampling avoids the non-uniform density induced by equirectangular pixel grids.

Sparse aerial reconstruction introduces a different failure mode, namely evidence imbalance between repeatedly observed roofs and weakly supported facades or occluded structures. AnyCity addresses this by first predicting an observation-grounded scaffold latent and only then using scaffold-conditioned completion tokens and a video diffusion prior to update weak regions before Gaussian decoding (Wu et al., 19 May 2026). In driving scenes, DrivingForward replaces strong overlap assumptions with scale-aware self-supervised depth and pose learning from flexible surround-view input, allowing one or multiple timesteps and different camera combinations without depth ground truth or per-frame extrinsics during training (Tian et al., 2024). In human-centered wide-baseline binocular settings, Splat-SAP uses scale-aware point map reconstruction, iterative affinity learning, and a target-view Gaussian plane to maintain stability under extreme sparsity (Zhou et al., 27 Nov 2025).

The same expansion has occurred in pose-free and semantic directions. AnySplat handles uncalibrated image collections and predicts both camera parameters and Gaussians in one pass, from single view up to hundreds of views (Jiang et al., 29 May 2025). UFV-Splatter adapts pose-free pretrained models to unfavorable views by recentering inputs, inserting LoRA layers into the backbone, and refining predicted Gaussians with a dedicated adapter and alignment stage (Fujimura et al., 30 Jul 2025). FLEG extends the output space itself: each semantic Gaussian carries a language-aligned feature vector in addition to geometry and color, enabling open-vocabulary querying and segmentation from arbitrary unposed, uncalibrated multi-view images without 3D annotations (Tian et al., 19 Dec 2025). Feed-forward Gaussian splatting therefore increasingly serves not only novel-view synthesis, but also camera estimation, semantic lifting, and controllable scene interaction.

5. Resolution, redundancy, and compression

A central systems challenge is that many feed-forward models are pixel-aligned, so primitive count grows with image resolution. PanSplat addresses this for panoramas through spherical lattices, local Gaussian heads, tiled processing, and deferred backpropagation on a single A100 GPU (Zhang et al., 2024). LGTM attacks the same scaling law more directly by decoupling geometry from rendering resolution: it predicts a compact grid of 2D Gaussian primitives and attaches per-primitive color and alpha textures, allowing high-fidelity 4K synthesis with significantly fewer Gaussian primitives than pixel-aligned feed-forward methods (Lao et al., 26 Mar 2026). The Z-Order Transformer further treats redundancy as a sequence-modeling problem: candidate Gaussians are Morton-ordered, attended over with sparse group and top- $\boldsymbol{\mu}$ 0 attention, and pooled to produce fewer Gaussian primitives while preserving critical structural details (Wang et al., 13 May 2026).

Compression has become a distinct subfield. CodecSplat moves the coding bottleneck from the irregular 3D Gaussian set back into the structured intermediate 2D Gaussian-generation feature, entropy-codes that latent, and reconstructs depth and Gaussian parameters on the decoder side. On DL3DV and RealEstate10K, it achieves $\boldsymbol{\mu}$ 1 dB and $\boldsymbol{\mu}$ 2 dB PSNR with only $\boldsymbol{\mu}$ 3 KiB and $\boldsymbol{\mu}$ 4 KiB per scene, respectively (Yu et al., 25 May 2026). A complementary feed-forward codec based on long-context modeling uses Morton serialization, attention-based transforms, and a space-channel auto-regressive entropy model over windows of thousands of Gaussians, and yields a $\boldsymbol{\mu}$ 5 compression ratio for 3DGS in a feed-forward inference (Liu et al., 30 Nov 2025). These developments suggest that feed-forward Gaussian splatting is increasingly being optimized as a deployable scene representation, not merely as an overview pipeline.

6. Supervision, evaluation, and open problems

Training signals vary widely but remain overwhelmingly image-driven. Many systems use pixel and perceptual losses between rendered and target views, sometimes with SSIM or LPIPS terms, and often without any 3D ground truth (Jiang et al., 26 May 2026, Jiang et al., 10 Mar 2026). Others supplement rendering loss with geometry-side supervision from pseudo-depth or pseudo-camera estimates: AnySplat distills pose and depth from VGGT while enforcing rendering and depth consistency (Jiang et al., 29 May 2025); FLEG combines photometric novel-view supervision with geometry distillation and instance-guided contrastive feature learning (Tian et al., 19 Dec 2025); DrivingForward uses self-supervised temporal, spatial, and spatial-temporal photometric consistency to train depth and pose before Gaussian prediction (Tian et al., 2024). Evaluation is correspondingly dominated by PSNR, SSIM, and LPIPS, with depth metrics, pose metrics, or semantic mIoU added when the task requires them.

Several limitations recur across the literature. Under extremely low SNR, severe artifacts, strong motion blur, or pose inaccuracies, feature matchability and cost-volume alignment can still fail (Jiang et al., 26 May 2026). Sparse aerial reconstruction remains vulnerable to extreme occlusions in dense skyscraper clusters and to domain shift in the aerial video prior (Wu et al., 19 May 2026). PanSplat remains limited by dynamic scenes, where moving content induces misaligned Gaussians and ghosting (Zhang et al., 2024). LGTM explicitly depends on geometry quality and still requires manual texture-resolution selection (Lao et al., 26 Mar 2026). More generally, many feed-forward Gaussian models remain tied to static-scene assumptions, fixed training distributions, or pixel-aligned primitive growth.

Current work points in several directions. One is richer condition-aware adaptation: lowlight, noise, appearance variation, and unfavorable views are already being handled by adapters, residual branches, and explicit appearance latents (Fujimura et al., 23 Apr 2026, Fujimura et al., 30 Jul 2025). Another is controlled generative augmentation that preserves observation-supported geometry, as in AnyCity and ProSplat (Wu et al., 19 May 2026, Lu et al., 9 Jun 2025). A third is semantic enrichment, where language-aligned Gaussian fields support open-vocabulary interaction rather than only rendering (Tian et al., 19 Dec 2025). A plausible implication is that feed-forward Gaussian splatting is evolving from a fast approximation to optimization-based 3DGS into a broader family of explicit scene predictors that unify reconstruction, rendering, compression, and semantic control. A nearby line of work even replaces Gaussians with directly predicted triangle primitives when explicit manifold geometry is required for simulation, underscoring that the feed-forward principle now extends beyond Gaussians themselves (Jinlin et al., 6 Mar 2026).