SPFSplatV2: Self-Supervised 3D Gaussian Splatting
- SPFSplatV2 is a self-supervised framework that jointly predicts 3D Gaussian primitives and camera poses from sparse, unposed multi-view images.
- It uses a unified vision transformer with masked attention and a pixel-wise reprojection loss to ensure robust geometric alignment and rapid inference.
- The method scales to diverse datasets and outperforms pose-dependent approaches in novel view synthesis benchmarks.
SPFSplatV2 is a state-of-the-art feed-forward, self-supervised 3D Gaussian splatting framework for novel view synthesis from sparse multi-view inputs without reliance on ground-truth camera poses throughout training or inference. It predicts both the 3D scene representation—via sets of Gaussian primitives in a canonical coordinate frame—and camera poses directly from unposed images, offering substantial advantages in scalability, generalization under severe viewpoint variation, and geometric consistency over pose-required or supervised pose-free approaches.
1. Architectural Overview
SPFSplatV2 reconstructs 3D scenes by inferring two tightly coupled outputs: a set of 3D Gaussian primitives encoding local geometry and appearance, and the corresponding camera poses in a canonical space (often seeded on the first input view). All predictions are made from raw, unposed multi-view images. The method's core is a shared vision transformer (ViT) backbone which encodes each view into feature tokens, followed by specialized heads for Gaussian parameter generation and pose estimation. This unified backbone enables joint optimization of geometric and camera parameters, enhancing consistency and eliminating the need for explicit pose annotation or Structure-from-Motion pre-processing.
2. Masked Attention and Feed-Forward Pipeline
SPFSplatV2 introduces a masked multi-view attention mechanism within its decoder design. During joint processing of context (geometry forming) and target (rendering/supervision) views, masked attention restricts context image tokens to attend only to other context tokens, ensuring independence of reconstructed scene geometry from target view specifics. In parallel, target tokens can integrate context and target information, facilitating robust target pose estimation. All stages are executed in a fully feed-forward manner—optimizing for reconstruction and pose in a single pass for rapid inference even with sparse input.
3. Joint Gaussian and Pose Prediction
The feed-forward ViT backbone outputs feature sequences for each input image. These are routed to separate branches for:
- Gaussian Prediction: Generates a collection of 3D Gaussian primitives (centers, covariance, appearance parameters) in the canonical reference.
- Pose Estimation: Predicts the relative camera pose for each input, with respect to the reference view.
The two branches share the ViT-extracted features, enabling geometric reasoning that links predicted scene structure and camera placement, significantly improving alignment and robustness compared to modular or two-stage pipelines.
4. Pixel-Wise Reprojection Loss
SPFSplatV2 leverages a pixel-level reprojection loss to enforce fine grain geometric correspondence between predicted Gaussians and image evidence. For each projected Gaussian center in a target image, the reprojection penalty is:
where denotes the projection via estimated camera intrinsics and pose . This loss directly anchors Gaussians to the observed pixels, suppressing drift and instability, and yielding strong geometric constraints without explicit 3D supervision.
5. Model Variants and Backbone Compatibility
SPFSplatV2 supports multiple architectural variants for its reconstruction pipeline:
Variant | Decoder Structure | Reference View Handling |
---|---|---|
SPFSplatV2 | Asymmetric (MASt3R) | Distinct |
SPFSplatV2-L | Unified (VGGT) | Shared |
Both variants maintain the core approach—joint Gaussian and pose prediction via feed-forward masked attention—but differ in decoder design and view handling strategies. This provides flexibility in optimizing for tradeoffs between computational budget and reconstruction fidelity.
6. Performance Evaluation
Empirical results show SPFSplatV2 achieves state-of-the-art scores on in-domain and out-of-domain novel view synthesis benchmarks, including scenarios with extreme viewpoint changes and limited image overlap. Quality metrics reported include PSNR, SSIM, and LPIPS, indicating superior pixel fidelity, structural similarity, and perceptual alignment over previous pose-required and pose-supervised frameworks. Notably, even with estimated target poses, SPFSplatV2 performs on par with or better than recent methods employing explicit geometric supervision.
7. Scalability and Comparisons
The elimination of ground-truth pose dependency renders SPFSplatV2 highly scalable. Large and diverse datasets lacking reliable camera annotations may now be leveraged for training, with further improvements observed in robustness and generalization under increased data volume and viewpoint spread. In comparison to recent self-supervised methods such as PF3plat and SelfSplat—which often separate reconstruction and pose modules—SPFSplatV2’s unified pipeline demonstrates enhanced geometric consistency and training stability.
In summary, SPFSplatV2 represents a substantial advance in 3D scene reconstruction via 3D Gaussian splatting from unposed, sparse-view imagery. Its innovations in feed-forward joint reasoning, masked attention, pixel-level geometric supervision, and flexible architectural implementation collectively position the framework as a scalable and high-performing solution for pose-free, self-supervised 3D scene reconstruction and novel view synthesis (Huang et al., 21 Sep 2025).