Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
153 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

PF3plat: Pose-Free Feed-Forward 3D Gaussian Splatting (2410.22128v1)

Published 29 Oct 2024 in cs.CV

Abstract: We consider the problem of novel view synthesis from unposed images in a single feed-forward. Our framework capitalizes on fast speed, scalability, and high-quality 3D reconstruction and view synthesis capabilities of 3DGS, where we further extend it to offer a practical solution that relaxes common assumptions such as dense image views, accurate camera poses, and substantial image overlaps. We achieve this through identifying and addressing unique challenges arising from the use of pixel-aligned 3DGS: misaligned 3D Gaussians across different views induce noisy or sparse gradients that destabilize training and hinder convergence, especially when above assumptions are not met. To mitigate this, we employ pre-trained monocular depth estimation and visual correspondence models to achieve coarse alignments of 3D Gaussians. We then introduce lightweight, learnable modules to refine depth and pose estimates from the coarse alignments, improving the quality of 3D reconstruction and novel view synthesis. Furthermore, the refined estimates are leveraged to estimate geometry confidence scores, which assess the reliability of 3D Gaussian centers and condition the prediction of Gaussian parameters accordingly. Extensive evaluations on large-scale real-world datasets demonstrate that PF3plat sets a new state-of-the-art across all benchmarks, supported by comprehensive ablation studies validating our design choices.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (88)
  1. Map-free visual relocalization: Metric pose relative to a single image. In European Conference on Computer Vision, pp.  690–708. Springer, 2022.
  2. Magsac: marginalizing sample consensus. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  10197–10205, 2019.
  3. Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  5855–5864, 2021.
  4. Mip-nerf 360: Unbounded anti-aliased neural radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  5470–5479, 2022.
  5. Matching 2d images in 3d: Metric relative pose from metric correspondences. arXiv preprint arXiv:2404.06337, 2024.
  6. Surf: Speeded up robust features. In Computer Vision–ECCV 2006: 9th European Conference on Computer Vision, Graz, Austria, May 7-13, 2006. Proceedings, Part I 9, pp.  404–417. Springer, 2006.
  7. Nope-nerf: Optimising neural radiance field with no pose prior. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  4160–4169, 2023.
  8. pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction. arXiv preprint arXiv:2312.12337, 2023.
  9. Mvsnerf: Fast generalizable radiance field reconstruction from multi-view stereo. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  14124–14133, 2021.
  10. Yu Chen and Gim Hee Lee. Dbarf: Deep bundle-adjusting generalizable neural radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  24–34, 2023.
  11. Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images. arXiv preprint arXiv:2403.14627, 2024.
  12. Cats: Cost aggregation transformers for visual correspondence. Advances in Neural Information Processing Systems, 34:9011–9023, 2021.
  13. Cats++: Boosting cost aggregation with convolutions and transformers. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
  14. Fully convolutional geometric features. In Proceedings of the IEEE/CVF international conference on computer vision, pp.  8958–8966, 2019.
  15. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems (NeurIPS), 2022.
  16. Superpoint: Self-supervised interest point detection and description. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp.  224–236, 2018.
  17. Learning to render novel views from wide-baseline stereo pairs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  4970–4980, 2023.
  18. Roma: Robust dense feature matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  19790–19800, 2024.
  19. Self-supervised correspondence estimation via multiview registration. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp.  1216–1225, January 2023.
  20. Drag view: Generalizable novel view synthesis with unposed imagery. arXiv preprint, 2023.
  21. Instantsplat: Unbounded sparse-view pose-free gaussian splatting in 40 seconds. arXiv preprint arXiv:2403.20309, 2024.
  22. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6):381–395, 1981.
  23. Colmap-free 3d gaussian splatting. 2023.
  24. John C Gower. Generalized procrustes analysis. Psychometrika, 40:33–51, 1975.
  25. Multiple view geometry in computer vision. Cambridge university press, 2003.
  26. Richard I Hartley. In defense of the eight-point algorithm. IEEE Transactions on pattern analysis and machine intelligence, 19(6):580–593, 1997.
  27. A direct least-squares (dls) method for pnp. In 2011 International Conference on Computer Vision, pp.  383–390. IEEE, 2011.
  28. Deep matching prior: Test-time optimization for dense correspondence. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  9907–9917, 2021.
  29. Cost aggregation with 4d convolutional swin transformer for few-shot segmentation. In European Conference on Computer Vision, pp.  108–126. Springer, 2022a.
  30. Neural matching fields: Implicit representation of matching fields for visual correspondence. Advances in Neural Information Processing Systems, 35:13512–13526, 2022b.
  31. Unifying correspondence, pose and nerf for pose-free novel view synthesis from stereo pairs. arXiv preprint arXiv:2312.07246, 2023.
  32. Unifying feature and cost aggregation with transformers for semantic and visual correspondence. arXiv preprint arXiv:2403.11120, 2024.
  33. Self-calibrating neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp.  5846–5854, October 2021.
  34. Leap: Liberate sparse-view 3d modeling from camera poses. arXiv preprint arXiv:2310.01410, 2023.
  35. Geonerf: Generalizing nerf with geometry priors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  18365–18375, 2022.
  36. Repurposing diffusion-based image generators for monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  9492–9502, 2024.
  37. Splatam: Splat, track & map 3d gaussians for dense rgb-d slam. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.
  38. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics, 42(4), July 2023. URL https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/.
  39. Infonerf: Ray entropy minimization for few-shot neural volume rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  12912–12921, 2022.
  40. Diederik P Kingma. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  41. Geconerf: Few-shot neural radiance fields via geometric consistency. arXiv preprint arXiv:2301.10941, 2023.
  42. Grounding image matching in 3d with mast3r. arXiv preprint arXiv:2406.09756, 2024.
  43. Ggrt: Towards generalizable 3d gaussians without pose priors in real-time. 2024.
  44. Learning to fuse monocular and multi-view cues for multi-frame depth estimation in dynamic scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  21539–21548, 2023.
  45. A robust o (n) solution to the perspective-n-point problem. IEEE transactions on pattern analysis and machine intelligence, 34(7):1444–1450, 2012.
  46. Barf: Bundle-adjusting neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  5741–5751, 2021.
  47. Lightglue: Local feature matching at light speed. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  17627–17638, 2023.
  48. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  22160–22169, 2024.
  49. Infinite nature: Perpetual view generation of natural scenes from a single image. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  14458–14467, 2021.
  50. David G Lowe. Distinctive image features from scale-invariant keypoints. International journal of computer vision, 60:91–110, 2004.
  51. Relative camera pose estimation using convolutional neural networks. In Advanced Concepts for Intelligent Vision Systems: 18th International Conference, ACIVS 2017, Antwerp, Belgium, September 18-21, 2017, Proceedings 18, pp.  675–687. Springer, 2017.
  52. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021.
  53. David Nistér. An efficient solution to the five-point relative pose problem. IEEE transactions on pattern analysis and machine intelligence, 26(6):756–770, 2004.
  54. Automatic differentiation in pytorch. 2017.
  55. Unidepth: Universal monocular metric depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  10106–10116, 2024.
  56. The 8-point algorithm as an inductive bias for relative pose prediction by vits. In 2022 International Conference on 3D Vision (3DV), pp.  1–11. IEEE, 2022.
  57. Fast point feature histograms (fpfh) for 3d registration. In 2009 IEEE international conference on robotics and automation, pp.  3212–3217. IEEE, 2009.
  58. Shot: Unique signatures of histograms for surface and texture description. Computer Vision and Image Understanding, 125:251–264, 2014.
  59. Superglue: Learning feature matching with graph neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  4938–4947, 2020.
  60. Light field networks: Neural scene representations with single-evaluation rendering. Advances in Neural Information Processing Systems, 34:19313–19325, 2021.
  61. Splatt3r: Zero-shot gaussian splatting from uncalibarated image pairs. arXiv preprint arXiv:2408.13912, 2024.
  62. Flowcam: Training generalizable 3d radiance fields without camera poses via pixel-aligned scene flow. arXiv preprint arXiv:2306.00180, 2023.
  63. Loftr: Detector-free local feature matching with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  8922–8931, 2021.
  64. Splatter image: Ultra-fast single-view 3d reconstruction. Conference on Computer Vision and Pattern Recognition (CVPR), 2024.
  65. A fast local descriptor for dense matching. In 2008 IEEE conference on computer vision and pattern recognition, pp.  1–8. IEEE, 2008.
  66. Pdc-net+: Enhanced probabilistic dense correspondence network. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023a.
  67. Sparf: Neural radiance fields from sparse and noisy poses. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  4190–4200, 2023b.
  68. Dust3r: Geometric 3d vision made easy. arXiv preprint arXiv:2312.14132, 2023.
  69. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
  70. Nerf–: Neural radiance fields without known camera parameters. arXiv preprint arXiv:2102.07064, 2021.
  71. Generalized differentiable ransac. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  17649–17660, 2023.
  72. Gmflow: Learning optical flow via global matching. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  8121–8130, 2022.
  73. Murf: Multi-baseline radiance fields. arXiv preprint arXiv:2312.04565, 2023a.
  74. Side adapter network for open-vocabulary semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  2945–2954, 2023b.
  75. Contranerf: Generalizable neural radiance fields for synthetic-to-real novel view synthesis via contrastive learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  16508–16517, 2023.
  76. Depth anything: Unleashing the power of large-scale unlabeled data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  10371–10381, 2024.
  77. Mvsnet: Depth inference for unstructured multi-view stereo. In Proceedings of the European conference on computer vision (ECCV), pp.  767–783, 2018.
  78. 3dfeat-net: Weakly supervised local 3d features for point cloud registration. In Proceedings of the European conference on computer vision (ECCV), pp.  607–623, 2018.
  79. Lift: Learned invariant feature transform. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VI 14, pp.  467–483. Springer, 2016.
  80. Metric3d: Towards zero-shot metric 3d prediction from a single image. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  9043–9053, 2023.
  81. pixelnerf: Neural radiance fields from one or few images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  4578–4587, 2021.
  82. Mip-splatting: Alias-free 3d gaussian splatting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  19447–19456, 2024.
  83. Relpose: Predicting probabilistic relative rotation for single objects in the wild. In European Conference on Computer Vision, pp.  592–611. Springer, 2022.
  84. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018.
  85. Unsupervised learning of depth and ego-motion from video. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  1851–1858, 2017.
  86. Stereo magnification: Learning view synthesis using multiplane images. arXiv preprint arXiv:1805.09817, 2018.
  87. On the continuity of rotation representations in neural networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
  88. Fsgs: Real-time few-shot view synthesis using gaussian splatting, 2023.
Citations (2)

Summary

  • The paper introduces a pose-free, feed-forward 3D Gaussian splatting method that mitigates misalignment issues through pre-trained depth and visual correspondence models.
  • The paper incorporates lightweight refinement modules to enhance coarse alignment and improve the accuracy of depth and pose estimates.
  • The paper demonstrates state-of-the-art performance on benchmarks like RealEstate10K and ACID, enabling real-time novel view synthesis in challenging scenarios.

Overview of PF3plat: Pose-Free Feed-Forward 3D Gaussian Splatting

This paper addresses the problem of novel view synthesis from unposed images using a framework named PF3plat, which leverages 3D Gaussian Splatting (3DGS) to enhance 3D reconstruction and view synthesis without relying on dense image views, accurate camera poses, or significant image overlaps. The authors propose a method that incorporates depth and pose estimation refinements, offering a robust pose-free, feed-forward framework.

Key Contributions

The authors identify and resolve challenges posed by pixel-aligned 3DGS, especially the issues of misaligned 3D Gaussian centers leading to destabilized training processes. To mitigate these challenges, the framework incorporates several innovative components, such as:

  1. Coarse Alignment Using Pre-trained Models: Utilizing pre-trained monocular depth estimation and visual correspondence models for initial coarse alignment helps stabilize the learning process. This approach is critical for promoting convergence in the absence of ground-truth camera poses.
  2. Refinement Modules: Lightweight and learnable modules refine the coarse alignments, improving depth and pose estimates' accuracy. These modules take advantage of geometric insights without requiring extensive computational resources, making them efficient for training and inference.
  3. Geometry Confidence Scores: By estimating geometry confidence scores, the framework can assess the reliability of 3D Gaussian centers, conditioning the prediction of Gaussian parameters such as opacity, covariance, and color.

Empirical Evaluation

PF3plat sets a new state-of-the-art across various benchmarks, including RealEstate10K and ACID, demonstrating superior performance in challenging scenarios with sparse image overlap or unposed imagery. Extensive evaluations highlight the benefits of this approach, where ablation studies confirm the efficacy of the proposed design choices.

Implications and Future Work

The proposed PF3plat framework signifies a notable advancement in the domain of 3D novel view synthesis, particularly in scenarios where photogrammetry might be infeasible due to lack of data or dense sampling. Its ability to perform single feed-forward processing makes it highly attractive for real-time applications and practical deployments.

Theoretically, the framework expands the applicability of 3D Gaussian representations, showing that accurate alignment and rendering can be achieved even when conventional assumptions are relaxed. It paves the way for future exploration into more complex scene reconstructions where obtaining camera poses is either challenging or impossible.

Future advancements may focus on enhancing the scalability and robustness of PF3plat for dynamic and large-scale environments. Moreover, integrating more sophisticated machine learning models for depth or visual correspondence could further improve its performance and extend its applicability to complex scenes, enriching the field of AI-driven visual computing.

Overall, the PF3plat framework is a significant contribution to the AI and computer vision landscape, providing practical solutions for novel view synthesis while posing intriguing questions for future research.