Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Epipolar-Free 3D Gaussian Splatting for Generalizable Novel View Synthesis (2410.22817v2)

Published 30 Oct 2024 in cs.CV

Abstract: Generalizable 3D Gaussian splitting (3DGS) can reconstruct new scenes from sparse-view observations in a feed-forward inference manner, eliminating the need for scene-specific retraining required in conventional 3DGS. However, existing methods rely heavily on epipolar priors, which can be unreliable in complex realworld scenes, particularly in non-overlapping and occluded regions. In this paper, we propose eFreeSplat, an efficient feed-forward 3DGS-based model for generalizable novel view synthesis that operates independently of epipolar line constraints. To enhance multiview feature extraction with 3D perception, we employ a selfsupervised Vision Transformer (ViT) with cross-view completion pre-training on large-scale datasets. Additionally, we introduce an Iterative Cross-view Gaussians Alignment method to ensure consistent depth scales across different views. Our eFreeSplat represents an innovative approach for generalizable novel view synthesis. Different from the existing pure geometry-free methods, eFreeSplat focuses more on achieving epipolar-free feature matching and encoding by providing 3D priors through cross-view pretraining. We evaluate eFreeSplat on wide-baseline novel view synthesis tasks using the RealEstate10K and ACID datasets. Extensive experiments demonstrate that eFreeSplat surpasses state-of-the-art baselines that rely on epipolar priors, achieving superior geometry reconstruction and novel view synthesis quality. Project page: https://tatakai1.github.io/efreesplat/.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (73)
  1. Masked Siamese Networks for Label-efficient Learning. CoRR, abs/2204.07141, 2022.
  2. Mip-NeRF: A Multiscale Representation for Anti-aliasing Neural Radiance Fields. CoRR, abs/2103.13415, 2021a.
  3. Mip-NeRF 360: Unbounded Anti-aliased Neural Radiance Fields. CoRR, abs/2111.12077, 2021b.
  4. Unsupervised Scale-consistent Depth and Ego-motion Learning from Monocular Video. CoRR, abs/1908.10553, 2019.
  5. High Accuracy Optical Flow Estimation Based on a Theory for Warping. In Computer Vision - ECCV 2004, 8th European Conference on Computer Vision, Prague, Czech Republic, May 11-14, 2004. Proceedings, Part IV, pages 25–36. Springer, 2004.
  6. pixelSplat: 3D Gaussian Splats from Image Pairs for Scalable Generalizable 3D Reconstruction. CoRR, abs/2312.12337, 2023.
  7. MVSNeRF: Fast Generalizable Radiance Field Reconstruction from Multi-view Stereo. CoRR, abs/2103.15595, 2021.
  8. A Survey on 3D Gaussian Splatting. CoRR, abs/2401.03890, 2024.
  9. MVSplat: Efficient 3D Gaussian Splatting from Sparse Multi-view Images. CoRR, abs/2403.14627, 2024.
  10. GaussianPro: 3D Gaussian Splatting with Progressive Propagation. CoRR, abs/2402.14650, 2024.
  11. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. CoRR, abs/2010.11929, 2020.
  12. Learning to Render Novel Views from Wide-baseline Stereo Pairs. CoRR, abs/2304.08463, 2023.
  13. CVRecon: Rethinking 3D Geometric Feature Learning For Neural Reconstruction. CoRR, abs/2304.14633, 2023.
  14. Expavatar: High-fidelity avatar generation of unseen expressions with 3d face priors. ACM Transactions on Multimedia Computing, Communications and Applications, 2024.
  15. Cascade Cost Volume for High-resolution Multi-view Stereo and Stereo Matching. CoRR, abs/1912.06378, 2019.
  16. Masked Autoencoders Are Scalable Vision Learners. CoRR, abs/2111.06377, 2021.
  17. Epipolar Transformers. CoRR, abs/2005.04551, 2020.
  18. 2D Gaussian Splatting for Geometrically Accurate Radiance Fields. CoRR, abs/2403.17888, 2024.
  19. FlowFormer: A Transformer Architecture for Optical Flow. CoRR, abs/2203.16194, 2022.
  20. Leap: Liberate sparse-view 3d modeling from camera poses. arXiv preprint arXiv:2310.01410, 2023.
  21. GeoNeRF: Generalizing NeRF with Geometry Priors. CoRR, abs/2111.13539, 2021.
  22. End-to-end Learning of Geometry and Context for Deep Stereo Regression. CoRR, abs/1703.04309, 2017.
  23. 3D Gaussian Splatting for Real-time Radiance Field Rendering. CoRR, abs/2308.04079, 2023.
  24. Adam: A Method for Stochastic Optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
  25. Practical Stereo Matching via Cascaded Recurrent Network with Adaptive Correlation. CoRR, abs/2203.11483, 2022.
  26. Infinite Nature: Perpetual View Generation of Natural Scenes from a Single Image. CoRR, abs/2012.09855, 2020.
  27. Neural Volumes: Learning Dynamic Renderable Volumes from Images. CoRR, abs/1906.07751, 2019.
  28. P-mvsnet: Learning patch-wise matching confidence aggregation for multi-view stereo. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10452–10461, 2019a.
  29. Attention-aware multi-view stereo. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1590–1599, 2020.
  30. Large language model and domain-specific model collaboration for smart education. Frontiers of Information Technology & Electronic Engineering, 25(3):333–341, 2024.
  31. Taking a closer look at domain shift: Category-level adversaries for semantics consistent domain adaptation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019b.
  32. Category-level adversarial adaptation for semantic segmentation using purified features. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
  33. Kill two birds with one stone: Domain generalization for semantic segmentation via network pruning. International Journal of Computer Vision, 2024.
  34. Reconstructing and simulating dynamic 3d objects with mesh-adsorbed gaussian splatting. arXiv preprint arXiv:2406.01593, 2024.
  35. A Large Dataset to Train Convolutional Networks for Disparity, Optical Flow, and Scene Flow Estimation. CoRR, abs/1512.02134, 2015.
  36. Pla4d: Pixel-level alignments for text-to-4d gaussian splatting. arXiv preprint arXiv:2405.19957, 2024.
  37. NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. CoRR, abs/2003.08934, 2020.
  38. Entangled view-epipolar information aggregation for generalizable neural radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4906–4916, 2024.
  39. Instant Neural Graphics Primitives with a Multiresolution Hash Encoding. CoRR, abs/2201.05989, 2022.
  40. Open Challenges in Deep Stereo: the Booster Dataset. CoRR, abs/2206.04671, 2022.
  41. High-resolution Image Synthesis with Latent Diffusion Models. CoRR, abs/2112.10752, 2021.
  42. U-net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pages 234–241. Springer, 2015.
  43. Scene representation transformer: Geometry-free novel view synthesis through set-latent scene representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6229–6238, 2022.
  44. High-resolution Stereo Datasets with Subpixel-accurate Ground Truth. In Pattern Recognition - 36th German Conference, GCPR 2014, Münster, Germany, September 2-5, 2014, Proceedings, pages 31–42. Springer, 2014.
  45. A Multi-view Stereo Benchmark with High-resolution Images and Multi-camera Videos. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 2538–2547. IEEE Computer Society, 2017.
  46. DeepVoxels: Learning Persistent 3D Feature Embeddings. CoRR, abs/1812.01024, 2018.
  47. RoFormer: Enhanced Transformer with Rotary Position Embedding. CoRR, abs/2104.09864, 2021.
  48. Light Field Neural Rendering. CoRR, abs/2112.09687, 2021.
  49. Generalizable Patch-based Neural Rendering. CoRR, abs/2207.10662, 2022.
  50. Splatter image: Ultra-fast single-view 3d reconstruction. arXiv preprint arXiv:2312.13150, 2023.
  51. Is Attention All That NeRF Needs? In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023.
  52. LGM: Large Multi-view Gaussian Model for High-resolution 3D Content Creation. CoRR, abs/2402.05054, 2024.
  53. Can Scale-consistent Monocular Depth Be Learned in a Self-supervised Scale-invariant Manner? In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pages 12707–12716. IEEE, 2021a.
  54. Pf-lrm: Pose-free large reconstruction model for joint pose and shape prediction. arXiv preprint arXiv:2311.12024, 2023a.
  55. IBRNet: Learning Multi-view Image-based Rendering. CoRR, abs/2102.13090, 2021b.
  56. DUSt3R: Geometric 3D Vision Made Easy. CoRR, abs/2312.14132, 2023b.
  57. Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process., 13(4):600–612, 2004.
  58. Masked Feature Prediction for Self-supervised Visual Pre-training. CoRR, abs/2112.09133, 2021.
  59. CroCo: Self-supervised Pre-training for 3D Vision Tasks by Cross-view Completion. CoRR, abs/2210.10716, 2022.
  60. CroCo v2: Improved Cross-view Completion Pre-training for Stereo Matching and Optical Flow. In IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, pages 17923–17934. IEEE, 2023.
  61. latentSplat: Autoencoding Variational Gaussians for Fast Generalizable 3D Reconstruction. CoRR, abs/2403.16292, 2024.
  62. Unifying Flow, Stereo and Depth Estimation. CoRR, abs/2211.05783, 2022.
  63. MVSNet: Depth Inference for Unstructured Multi-view Stereo. CoRR, abs/1804.02505, 2018.
  64. pixelNeRF: Neural Radiance Fields from One or Few Images. CoRR, abs/2012.02190, 2020.
  65. Plenoxels: Radiance Fields without Neural Networks. CoRR, abs/2112.05131, 2021a.
  66. Plenoctrees for real-time rendering of neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5752–5761, 2021b.
  67. NeRF++: Analyzing and Improving Neural Radiance Fields. CoRR, abs/2010.07492, 2020.
  68. Gs-lrm: Large reconstruction model for 3d gaussian splatting. arXiv preprint arXiv:2404.19702, 2024.
  69. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. CoRR, abs/1801.03924, 2018.
  70. Zhengyou Zhang. Determining the Epipolar Geometry and its Uncertainty: A Review. Int. J. Comput. Vis., 27(2):161–195, 1998.
  71. GPS-Gaussian: Generalizable Pixel-wise 3D Gaussian Splatting for Real-time Human Novel View Synthesis. CoRR, abs/2312.02155, 2023.
  72. Stereo Magnification: Learning View Synthesis using Multiplane Images. CoRR, abs/1805.09817, 2018.
  73. Triplane Meets Gaussian Splatting: Fast and Generalizable Single-view 3D Reconstruction with Transformers. CoRR, abs/2312.09147, 2023.

Summary

  • The paper presents eFreeSplat, an epipolar‐free model that leverages a self-supervised Vision Transformer to bypass traditional epipolar geometry for robust novel view synthesis.
  • It employs an iterative cross-view Gaussians alignment technique to harmonize depth scales, reducing artifacts and improving rendering quality in sparse, non-overlapping views.
  • Evaluations on datasets such as RealEstate10K and ACID show that eFreeSplat outperforms conventional epipolar-based methods in PSNR, SSIM, and computational efficiency.

Overview of Epipolar-Free 3D Gaussian Splatting for Generalizable Novel View Synthesis

The paper "Epipolar-Free 3D Gaussian Splatting for Generalizable Novel View Synthesis" presents an innovative approach to improving the quality and generalizability of novel view synthesis (NVS) by leveraging a novel method that eschews reliance on epipolar geometry. The proposed model, named eFreeSplat, addresses critical shortcomings in existing 3D Gaussian Splatting (3DGS) techniques that depend on epipolar priors. This reliance often fails under real-world conditions with sparse views, occluded regions, and non-overlapping images. By integrating a self-supervised learned Vision Transformer (ViT) for multiview feature extraction, eFreeSplat provides an efficient and effective solution for generalizable NVS tasks.

The research addresses several theoretical and practical gaps in the field of 3D vision and rendering. Notably, it challenges the conventional dependency on epipolar geometry for determining pixel correspondences across images—a method that proves fragile under non-ideal conditions. By utilizing a ViT backbone pre-trained on a large dataset for cross-view completion, eFreeSplat synthesizes novel views without degrading quality in challenging scenarios, such as when images have minimal overlap or are taken from vastly different angles.

Methodological Advancements

  1. Epipolar-Free Approach: The core advancement is the departure from epipolar line constraints which are traditionally used to establish geometric correspondences between images. Instead, the use of a ViT for feature extraction enables the model to infer consistent 3D structural information by analyzing cross-view features at a high level. This makes the method robust to scenarios that render epipolar priors ineffective.
  2. Iterative Cross-View Gaussians Alignment Method: An innovative component of the methodology is the Iterative Cross-view Gaussians Alignment (ICGA) technique, designed to harmonize depth scales across multiple views. By iteratively refining Gaussians' attributes through a feedback loop using warped view features, the method reduces discrepancies in the scale of predicted depth maps. This ensures more accurate rendering, reducing artifacts commonly introduced by scale-inconsistencies when aggregating multiview information.
  3. Self-Supervised Pre-training: By implementing a self-supervised vision transformer model pre-trained on extensive cross-view datasets, eFreeSplat naturally incorporates 3D priors without explicit geometric constraints. The pre-training provides a robust understanding of global spatial relations across views, crucial for overcoming the challenges posed by overlapping and occluded regions.

Numerical Results and Evaluation

eFreeSplat was evaluated against leading state-of-the-art approaches like pixelSplat and MVSplat, especially in wide-baseline NVS tasks utilizing the RealEstate10K and ACID datasets. The results demonstrated superior performance, achieving higher geometric fidelity and rendering quality:

  • The model achieved significant improvements over epipolar-based models, reflected in metrics such as PSNR and SSIM.
  • It reduced artifacts and inaccuracies in both depth and color reconstructions, particularly effective in scenes with minimal reference input overlap.
  • In terms of computational efficiency, eFreeSplat provided competitive rendering times, showing promise for real-time applications.

Implications and Future Directions

The implications of eFreeSplat are far-reaching. Practically, it simplifies deploying NVS systems in unconstrained environments, such as augmented reality or autonomous driving, where input images often have irregular overlaps and obstructions. Theoretically, it proposes a shift towards data-driven geometric understanding, emphasizing feature-based correspondences over hard-coded geometric rules.

Future research could explore expanding the training datasets to enhance the model's robustness further, possibly integrating multi-modal data inputs such as LiDAR alongside visual data to enrich scene understanding. Additionally, exploring the interplay between this model's architecture and generative capabilities of diffusion models could unlock new paradigms in 3D scene synthesis and manipulation.

In conclusion, eFreeSplat marks a promising advance in the field of 3D novel view synthesis with its epipolar-free methodology, opening avenues for more robust and versatile applications in AI and computer vision.

Github Logo Streamline Icon: https://streamlinehq.com

GitHub

X Twitter Logo Streamline Icon: https://streamlinehq.com