Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 86 tok/s

Gemini 2.5 Pro 53 tok/s Pro

GPT-5 Medium 19 tok/s Pro

GPT-5 High 25 tok/s Pro

GPT-4o 84 tok/s Pro

Kimi K2 129 tok/s Pro

GPT OSS 120B 430 tok/s Pro

Claude Sonnet 4 37 tok/s Pro

2000 character limit reached

SPFSplatV2: Efficient Self-Supervised Pose-Free 3D Gaussian Splatting from Sparse Views (2509.17246v1)

Published 21 Sep 2025 in cs.CV

Abstract: We introduce SPFSplatV2, an efficient feed-forward framework for 3D Gaussian splatting from sparse multi-view images, requiring no ground-truth poses during training and inference. It employs a shared feature extraction backbone, enabling simultaneous prediction of 3D Gaussian primitives and camera poses in a canonical space from unposed inputs. A masked attention mechanism is introduced to efficiently estimate target poses during training, while a reprojection loss enforces pixel-aligned Gaussian primitives, providing stronger geometric constraints. We further demonstrate the compatibility of our training framework with different reconstruction architectures, resulting in two model variants. Remarkably, despite the absence of pose supervision, our method achieves state-of-the-art performance in both in-domain and out-of-domain novel view synthesis, even under extreme viewpoint changes and limited image overlap, and surpasses recent methods that rely on geometric supervision for relative pose estimation. By eliminating dependence on ground-truth poses, our method offers the scalability to leverage larger and more diverse datasets. Code and pretrained models will be available on our project page: https://ranrhuang.github.io/spfsplatv2/.

Summary

The paper presents a self-supervised approach that unifies 3D Gaussian prediction and camera pose estimation without requiring ground-truth poses.
It leverages a transformer-based architecture with masked attention and learnable pose tokens to achieve state-of-the-art novel view synthesis and 3D reconstruction.
Experimental results demonstrate superior cross-dataset generalization, efficient training, and high-fidelity 3D geometry even under sparse and extreme viewpoints.

SPFSplatV2: Self-Supervised Pose-Free 3D Gaussian Splatting from Sparse Views

Introduction and Motivation

SPFSplatV2 addresses the challenge of 3D scene reconstruction and novel view synthesis (NVS) from sparse, unposed multi-view images, eliminating the need for ground-truth camera poses during both training and inference. This is a significant departure from prior 3DGS and NeRF-based pipelines, which typically rely on accurate pose information, often obtained via computationally expensive and unreliable SfM, especially in sparse-view or low-overlap scenarios. The method is designed to be scalable, efficient, and robust, enabling the exploitation of large, diverse, and unposed datasets for 3D reconstruction.

Figure 1: Comparison of three typical training pipelines for sparse-view 3D reconstruction in novel view synthesis. SPFSplatV2 (c) eliminates the need for ground-truth poses by leveraging estimated target poses for both reconstruction and rendering loss.

Methodology

Unified Feed-Forward Architecture

SPFSplatV2 employs a unified feed-forward transformer-based architecture with a shared backbone for both 3D Gaussian primitive prediction and camera pose estimation. The network processes $N$ context images and $M$ target images, predicting 3D Gaussians in a canonical space (with the first view as reference) and estimating all camera poses relative to this reference.

Figure 2: Training pipeline of SPFSplatV2. A shared backbone with three specialized heads predicts Gaussian centers, additional Gaussian parameters, and camera poses from unposed images. Masked attention ensures independence between context and target information during Gaussian reconstruction.

Key architectural components include:

ViT-based Encoder/Decoder: Extracts per-view features and aggregates multi-view information.
Learnable Pose Tokens: Each view is assigned a learnable token, enabling selective attention to geometric cues for accurate pose regression.
Masked Multi-View Attention: During training, context tokens attend only to context tokens, preventing target-view information leakage into Gaussian reconstruction. Target tokens attend to all tokens, facilitating accurate target pose estimation.
Task-Specific Heads: DPT-based heads predict Gaussian centers and parameters; a dedicated MLP-based head regresses camera poses.
Figure 3: Architecture comparison of SPFSplatV2 and SPFSplatV2-L. SPFSplatV2 uses asymmetric decoders and heads, while SPFSplatV2-L employs a unified VGGT-style design.

Loss Functions and Optimization

Rendering Loss: Combines $L_2$ and LPIPS losses between rendered and ground-truth target images, using predicted target poses.
Reprojection Loss: Enforces geometric consistency by minimizing the reprojection error between predicted 3D Gaussian centers and their corresponding 2D pixels, using the estimated context poses.
Multi-View Dropout: Randomly drops intermediate context views during training, improving generalization to varying input configurations.

Model Variants

SPFSplatV2: MASt3R-style architecture with asymmetric decoder and heads.
SPFSplatV2-L: VGGT-style architecture with unified decoder and heads, providing superior multi-view reasoning and generalization.
Figure 4: Comparison of cross-attention in (a) SPFSplat and (b) SPFSplatV2/V2-L. Masked attention in SPFSplatV2 reduces computation and prevents information leakage.

Experimental Results

Novel View Synthesis

SPFSplatV2 and SPFSplatV2-L achieve state-of-the-art results on RE10K and ACID, outperforming both pose-required and pose-free baselines, including those with ground-truth pose supervision. Notably, the models maintain high performance even under minimal input overlap and extreme viewpoint changes.

Figure 5: Qualitative comparison on RE10K and ACID. SPFSplatV2 better handles extreme viewpoint changes, preserves fine details, and reduces ghosting artifacts compared to baselines.

Cross-Dataset Generalization

Both variants generalize robustly to out-of-domain datasets (ACID, DTU, DL3DV, ScanNet++), demonstrating strong zero-shot performance and geometric consistency, even when trained solely on RE10K.

Figure 6: Qualitative comparison on cross-dataset generalization. SPFSplatV2 and SPFSplatV2-L yield more accurate reconstructions than prior methods.

Relative Pose Estimation

SPFSplatV2 achieves superior pose estimation accuracy compared to both classical SfM pipelines and recent learning-based methods, despite the absence of geometric supervision. Both direct regression and PnP-based strategies yield consistent results, indicating strong alignment between predicted poses and reconstructed 3D points.

3D Geometry Quality

The method produces high-fidelity 3D Gaussians and sharper renderings, with improved structural accuracy and reduced artifacts relative to prior approaches.

Figure 7: Comparison of 3D Gaussians and rendered results. SPFSplatV2 produces higher-quality 3D Gaussians and better rendering over baselines.

Scalability and Efficiency

Inference: Comparable or better runtime and FLOPs than prior pose-free methods, with significant speedups over methods relying on explicit geometric operations.
Training: Masked attention reduces training FLOPs and memory usage compared to dual-branch designs.
Scalability: The framework scales efficiently to large, unposed datasets, with performance improving as training data increases.

Ablation Studies

Masked Attention: Reduces computational cost and improves geometric alignment.
Learnable Pose Tokens: Enhance pose estimation accuracy.
Reprojection Loss: Critical for stable training and accurate geometry.
Intrinsic Embedding: Improves scale alignment but is not essential for strong performance.
Initialization: Pretraining on MASt3R or VGGT is beneficial; random initialization leads to a performance drop but does not preclude learning.

Real-World and Failure Cases

SPFSplatV2 generalizes to in-the-wild mobile phone images without intrinsics, but failure cases occur in occluded, textureless, or highly ambiguous regions.

Figure 8: 3D Gaussians from smartphone without intrinsics and rendered image.

Figure 9: Failure cases of SPFSplatV2. Blurriness and artifacts occur in occluded or texture-less regions and under extreme viewpoint changes.

Implications and Future Directions

SPFSplatV2 demonstrates that self-supervised, pose-free 3D Gaussian splatting is feasible and effective for sparse-view NVS, even in the absence of any geometric supervision. The approach enables scalable 3D reconstruction from large, unposed, and diverse datasets, removing a major bottleneck in data collection and annotation. The joint optimization of geometry and pose within a unified architecture leads to improved geometric consistency and stability.

However, the method still benefits from pretrained geometric priors, and its non-generative nature limits the reconstruction of unseen regions. Future work should explore integrating generative models for hallucinating occluded content, further improving robustness to ambiguous or textureless regions, and leveraging even larger and more diverse datasets for enhanced generalization.

Conclusion

SPFSplatV2 establishes a new paradigm for pose-free 3D scene reconstruction and novel view synthesis from sparse, unposed images. By unifying scene and pose estimation in a single feed-forward transformer with masked attention and geometric constraints, it achieves state-of-the-art performance in both in-domain and out-of-domain settings, with strong efficiency and scalability. The framework's independence from ground-truth poses paves the way for large-scale, real-world 3D reconstruction and generalizable NVS, with broad implications for computer vision, robotics, and AR/VR applications.