Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 86 tok/s

Gemini 2.5 Pro 53 tok/s Pro

GPT-5 Medium 19 tok/s Pro

GPT-5 High 25 tok/s Pro

GPT-4o 84 tok/s Pro

Kimi K2 129 tok/s Pro

GPT OSS 120B 430 tok/s Pro

Claude Sonnet 4 37 tok/s Pro

2000 character limit reached

SPFSplatV2: Self-Supervised 3D Gaussian Splatting

Updated 23 September 2025

SPFSplatV2 is a self-supervised framework that jointly predicts 3D Gaussian primitives and camera poses from sparse, unposed multi-view images.
It uses a unified vision transformer with masked attention and a pixel-wise reprojection loss to ensure robust geometric alignment and rapid inference.
The method scales to diverse datasets and outperforms pose-dependent approaches in novel view synthesis benchmarks.

SPFSplatV2 is a state-of-the-art feed-forward, self-supervised 3D Gaussian splatting framework for novel view synthesis from sparse multi-view inputs without reliance on ground-truth camera poses throughout training or inference. It predicts both the 3D scene representation—via sets of Gaussian primitives in a canonical coordinate frame—and camera poses directly from unposed images, offering substantial advantages in scalability, generalization under severe viewpoint variation, and geometric consistency over pose-required or supervised pose-free approaches.

1. Architectural Overview

SPFSplatV2 reconstructs 3D scenes by inferring two tightly coupled outputs: a set of 3D Gaussian primitives encoding local geometry and appearance, and the corresponding camera poses in a canonical space (often seeded on the first input view). All predictions are made from raw, unposed multi-view images. The method's core is a shared vision transformer (ViT) backbone which encodes each view into feature tokens, followed by specialized heads for Gaussian parameter generation and pose estimation. This unified backbone enables joint optimization of geometric and camera parameters, enhancing consistency and eliminating the need for explicit pose annotation or Structure-from-Motion pre-processing.

2. Masked Attention and Feed-Forward Pipeline

SPFSplatV2 introduces a masked multi-view attention mechanism within its decoder design. During joint processing of context (geometry forming) and target (rendering/supervision) views, masked attention restricts context image tokens to attend only to other context tokens, ensuring independence of reconstructed scene geometry from target view specifics. In parallel, target tokens can integrate context and target information, facilitating robust target pose estimation. All stages are executed in a fully feed-forward manner—optimizing for reconstruction and pose in a single pass for rapid inference even with sparse input.

3. Joint Gaussian and Pose Prediction

The feed-forward ViT backbone outputs feature sequences for each input image. These are routed to separate branches for:

Gaussian Prediction: Generates a collection of 3D Gaussian primitives $\{\mu_j, \Sigma_j, a_j\}$ (centers, covariance, appearance parameters) in the canonical reference.
Pose Estimation: Predicts the relative camera pose $P^{(v\to1)}$ for each input, with respect to the reference view.

The two branches share the ViT-extracted features, enabling geometric reasoning that links predicted scene structure and camera placement, significantly improving alignment and robustness compared to modular or two-stage pipelines.

4. Pixel-Wise Reprojection Loss

SPFSplatV2 leverages a pixel-level reprojection loss to enforce fine grain geometric correspondence between predicted Gaussians and image evidence. For each projected Gaussian center in a target image, the reprojection penalty is:

$\mathcal{L}_{\text{reproj}} = \sum_{v=1}^N \sum_{j=1}^{H\cdot W} \| p_j^v - \pi(K^v, P^{(v\to1)}, \mu_j^{(v\to1)}) \|$

where $\pi$ denotes the projection via estimated camera intrinsics $K^v$ and pose $P^{(v\to1)}$ . This loss directly anchors Gaussians to the observed pixels, suppressing drift and instability, and yielding strong geometric constraints without explicit 3D supervision.

5. Model Variants and Backbone Compatibility

SPFSplatV2 supports multiple architectural variants for its reconstruction pipeline:

Variant	Decoder Structure	Reference View Handling
SPFSplatV2	Asymmetric (MASt3R)	Distinct
SPFSplatV2-L	Unified (VGGT)	Shared

Both variants maintain the core approach—joint Gaussian and pose prediction via feed-forward masked attention—but differ in decoder design and view handling strategies. This provides flexibility in optimizing for tradeoffs between computational budget and reconstruction fidelity.

6. Performance Evaluation

Empirical results show SPFSplatV2 achieves state-of-the-art scores on in-domain and out-of-domain novel view synthesis benchmarks, including scenarios with extreme viewpoint changes and limited image overlap. Quality metrics reported include PSNR, SSIM, and LPIPS, indicating superior pixel fidelity, structural similarity, and perceptual alignment over previous pose-required and pose-supervised frameworks. Notably, even with estimated target poses, SPFSplatV2 performs on par with or better than recent methods employing explicit geometric supervision.

7. Scalability and Comparisons

The elimination of ground-truth pose dependency renders SPFSplatV2 highly scalable. Large and diverse datasets lacking reliable camera annotations may now be leveraged for training, with further improvements observed in robustness and generalization under increased data volume and viewpoint spread. In comparison to recent self-supervised methods such as PF3plat and SelfSplat—which often separate reconstruction and pose modules—SPFSplatV2’s unified pipeline demonstrates enhanced geometric consistency and training stability.

In summary, SPFSplatV2 represents a substantial advance in 3D scene reconstruction via 3D Gaussian splatting from unposed, sparse-view imagery. Its innovations in feed-forward joint reasoning, masked attention, pixel-level geometric supervision, and flexible architectural implementation collectively position the framework as a scalable and high-performing solution for pose-free, self-supervised 3D scene reconstruction and novel view synthesis (Huang et al., 21 Sep 2025).

PDF Markdown Chat (Pro)

References (1)

SPFSplatV2: Efficient Self-Supervised Pose-Free 3D Gaussian Splatting from Sparse Views (2025)

Follow Topic

Get notified by email when new papers are published related to SPFSplatV2.

SPFSplatV2: Self-Supervised 3D Gaussian Splatting

1. Architectural Overview

2. Masked Attention and Feed-Forward Pipeline

3. Joint Gaussian and Pose Prediction

4. Pixel-Wise Reprojection Loss

5. Model Variants and Backbone Compatibility

6. Performance Evaluation

7. Scalability and Comparisons

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

SPFSplatV2: Self-Supervised 3D Gaussian Splatting

1. Architectural Overview

2. Masked Attention and Feed-Forward Pipeline

3. Joint Gaussian and Pose Prediction

4. Pixel-Wise Reprojection Loss

5. Model Variants and Backbone Compatibility

6. Performance Evaluation

7. Scalability and Comparisons

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research