Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

FreeSplat: Generalizable 3D Gaussian Splatting Towards Free-View Synthesis of Indoor Scenes (2405.17958v3)

Published 28 May 2024 in cs.CV

Abstract: Empowering 3D Gaussian Splatting with generalization ability is appealing. However, existing generalizable 3D Gaussian Splatting methods are largely confined to narrow-range interpolation between stereo images due to their heavy backbones, thus lacking the ability to accurately localize 3D Gaussian and support free-view synthesis across wide view range. In this paper, we present a novel framework FreeSplat that is capable of reconstructing geometrically consistent 3D scenes from long sequence input towards free-view synthesis.Specifically, we firstly introduce Low-cost Cross-View Aggregation achieved by constructing adaptive cost volumes among nearby views and aggregating features using a multi-scale structure. Subsequently, we present the Pixel-wise Triplet Fusion to eliminate redundancy of 3D Gaussians in overlapping view regions and to aggregate features observed across multiple views. Additionally, we propose a simple but effective free-view training strategy that ensures robust view synthesis across broader view range regardless of the number of views. Our empirical results demonstrate state-of-the-art novel view synthesis peformances in both novel view rendered color maps quality and depth maps accuracy across different numbers of input views. We also show that FreeSplat performs inference more efficiently and can effectively reduce redundant Gaussians, offering the possibility of feed-forward large scene reconstruction without depth priors.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction. arXiv preprint arXiv:2312.12337, 2023.
  2. Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images. arXiv preprint arXiv:2403.14627, 2024.
  3. Nerf: Representing scenes as neural radiance fields for view synthesis. In European Conference on Computer Vision, pages 405–421. Springer, 2020.
  4. Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction. arXiv preprint arXiv:2106.10689, 2021.
  5. Instant neural graphics primitives with a multiresolution hash encoding. ACM Transactions on Graphics (ToG), 41(4):1–15, 2022.
  6. Mip-nerf 360: Unbounded anti-aliased neural radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5470–5479, 2022.
  7. Ibrnet: Learning multi-view image-based rendering. In CVPR, 2021.
  8. Mvsnerf: Fast generalizable radiance field reconstruction from multi-view stereo. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14124–14133, 2021.
  9. Neural rays for occlusion-aware image-based rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7824–7833, 2022.
  10. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics, 42(4):1–14, 2023.
  11. Mip-splatting: Alias-free 3d gaussian splatting. arXiv preprint arXiv:2311.16493, 2023.
  12. Octree-gs: Towards consistent real-time rendering with lod-structured 3d gaussians. arXiv preprint arXiv:2403.17898, 2024.
  13. Dynamic 3d gaussians: Tracking by persistent dynamic view synthesis. arXiv preprint arXiv:2308.09713, 2023.
  14. latentsplat: Autoencoding variational gaussians for fast generalizable 3d reconstruction. arXiv preprint arXiv:2403.16292, 2024.
  15. Ggrt: Towards generalizable 3d gaussians without pose priors in real-time. arXiv preprint arXiv:2403.10147, 2024.
  16. Triplane meets gaussian splatting: Fast and generalizable single-view 3d reconstruction with transformers. arXiv preprint arXiv:2312.09147, 2023.
  17. Gps-gaussian: Generalizable pixel-wise 3d gaussian splatting for real-time human novel view synthesis. arXiv preprint arXiv:2312.02155, 2023.
  18. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017.
  19. The replica dataset: A digital replica of indoor spaces. arXiv preprint arXiv:1906.05797, 2019.
  20. Neural volumes: Learning dynamic renderable volumes from images. arXiv preprint arXiv:1906.07751, 2019.
  21. Deepvoxels: Learning persistent 3d feature embeddings. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2437–2446, 2019.
  22. Stereo magnification: Learning view synthesis using multiplane images. arXiv preprint arXiv:1805.09817, 2018.
  23. Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5855–5864, 2021.
  24. Zip-nerf: Anti-aliased grid-based neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19697–19705, 2023.
  25. 4d gaussian splatting for real-time dynamic scene rendering. arXiv preprint arXiv:2310.08528, 2023.
  26. pixelnerf: Neural radiance fields from one or few images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4578–4587, 2021.
  27. Regnerf: Regularizing neural radiance fields for view synthesis from sparse inputs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5480–5490, 2022.
  28. Convolutional occupancy networks. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16, pages 523–540. Springer, 2020.
  29. Neuralrecon: Real-time coherent 3d reconstruction from monocular video. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15598–15607, 2021.
  30. Vortx: Volumetric 3d reconstruction with transformers for voxelwise view selection and fusion. In 2021 International Conference on 3D Vision (3DV), pages 320–330. IEEE, 2021.
  31. Simplerecon: 3d reconstruction without 3d convolutions. In European Conference on Computer Vision, pages 1–19. Springer, 2022.
  32. Nice-slam: Neural implicit scalable encoding for slam. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12786–12796, 2022.
  33. Gs-slam: Dense visual slam with 3d gaussian splatting. arXiv preprint arXiv:2311.11700, 2023.
  34. Splatam: Splat, track & map 3d gaussians for dense rgb-d slam. arXiv preprint arXiv:2312.02126, 2023.
  35. Volume rendering of neural implicit surfaces. Advances in Neural Information Processing Systems, 34:4805–4815, 2021.
  36. Monosdf: Exploring monocular geometric cues for neural implicit surface reconstruction. Advances in neural information processing systems, 35:25018–25032, 2022.
  37. Neuralangelo: High-fidelity neural surface reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8456–8465, 2023.
  38. Surfelnerf: Neural surfel radiance fields for online photorealistic reconstruction of indoor scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 108–118, 2023.
  39. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  40. An image is worth 16x16 words: Transformers for image recognition at scale. arxiv 2020. arXiv preprint arXiv:2010.11929, 2010.
  41. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
  42. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, pages 6105–6114. PMLR, 2019.
  43. Robert T Collins. A space-sweep approach to true multi-image matching. In Proceedings CVPR IEEE computer society conference on computer vision and pattern recognition, pages 358–363. Ieee, 1996.
  44. Dpsnet: End-to-end deep plane sweep stereo. arXiv preprint arXiv:1905.00538, 2019.
  45. Deepvideomvs: Multi-view stereo on video with recurrent spatio-temporal fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15324–15333, 2021.
  46. Unet++: A nested u-net architecture for medical image segmentation. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 20, 2018, Proceedings 4, pages 3–11. Springer, 2018.
  47. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018.
  48. Nerfusion: Fusing radiance fields for large-scale scene reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5449–5458, 2022.
  49. In-place scene labelling and understanding with implicit scene representation. 2021.
  50. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
Citations (5)

Summary

  • The paper introduces FreeSplat, a framework that reconstructs consistent 3D Gaussians from long image sequences to enable free-view indoor scene synthesis.
  • It leverages low-cost cross-view aggregation, pixel-wise triplet fusion, and free-view training to optimize feature alignment and eliminate redundancy.
  • Empirical results on ScanNet and Replica demonstrate marked PSNR improvements and enhanced novel view depth rendering, proving its broad applicability.

FreeSplat: Generalizable 3D Gaussian Splatting for Real-Time Long Sequence Reconstruction and Free-View Synthesis

Introduction

The paper presents an innovative framework, FreeSplat, which addresses the limitations of existing 3D Gaussian Splatting (3DGS) methods. The primary contributions are centered around the reconstruction of geometrically consistent 3D Gaussians from long sequences of input images and support for free-view synthesis across a wide range of viewpoints. Unlike previous methods constrained to narrow-range interpolation between stereo images, FreeSplat is designed to enable efficient, accurate localization of 3D Gaussians and can handle extensive view ranges.

Technical Overview

The proposed FreeSplat framework consists of three core components: Low-cost Cross-View Aggregation, Pixel-wise Triplet Fusion (PTF), and a novel Free-View Training (FVT) strategy. Each of these components contributes to FreeSplat's capability to localize 3D Gaussians accurately and synthesize views from arbitrary poses.

  1. Low-cost Cross-View Aggregation:
    • This component involves constructing adaptive cost volumes among nearby views and aggregating features using a multi-scale structure.
    • It employs efficient CNN-based backbones to balance feature extraction and matching with computational feasibility.
    • The methodology enhances pose information integration by building cost volumes between nearby views, allowing for broader receptive fields and robust feature aggregation.
  2. Pixel-wise Triplet Fusion (PTF):
    • PTF is used to eliminate redundancy of 3D Gaussians in overlapping view regions and to aggregate features observed across multiple views. This is particularly crucial for real-time rendering and reducing computational load.
    • A Pixel-wise Alignment strategy corresponding local and global Gaussian triplets facilitates this fusion.
    • The approach progressively integrates Gaussian triplets with geometric constraints and learnable feature updates, ensuring efficient post-aggregation across multiple views.
  3. Free-View Training (FVT):
    • The FVT strategy disentangles the generalizable 3DGS performance from the specific number of input views by supervising rendered images in a broader view interpolations setting.
    • This training strategy ensures robust view synthesis across broader view ranges, contributing significantly to novel view depth rendering accuracy.

Empirical Results

The empirical evaluation of FreeSplat was conducted on the ScanNet and Replica datasets. Various experimental settings, including 2-view, 3-view, and 10-view sequences, were employed to assess the performance comprehensively.

  • View Interpolation on ScanNet:
    • FreeSplat significantly outperformed existing methods with a PSNR improvement of over 1.67 dB in the 2-views setting and 1.48 dB in the 3-views setting.
    • In terms of efficiency, FreeSplat exhibited a much faster inference speed compared to NeuRay while delivering superior image synthesis quality.
  • Long Sequence Reconstruction on ScanNet:
    • FreeSplat demonstrated marked improvements in both view interpolation and view extrapolation when provided with 10 input views. Results showed a PSNR gain of over 1.80 dB compared to previous 3DGS methods.
    • The FVT strategy provided an additional boost, enhancing PSNR by over 1.85 dB compared to the FreeSplat model trained without FVT.
  • Novel View Depth Rendering:
    • FreeSplat achieved superior depth estimation accuracy across novel views, outperforming other methods significantly. The δ<1.25\delta<1.25 metric rose by over 27.0%, highlighting the framework's capacity to support accurate unsupervised depth estimation.

Zero-Shot Transfer to Replica

FreeSplat's generalizability was further validated through zero-shot evaluations on the Replica dataset. The framework maintained superior performance in view interpolation and novel view depth rendering. Despite some performance degradation in long sequence reconstructions due to domain gap and depth estimation inaccuracies, FreeSplat's flexible architecture and FVT strategy provided a strong foundation for future improvements in cross-domain applications.

Conclusion and Future Directions

FreeSplat contributes significantly to the field of 3D scene reconstruction and novel view synthesis. Its efficient feature aggregation, redundancy elimination, and adaptable training strategy offer a compelling solution for real-time rendering and large scene reconstruction. Future research may explore enhancements in zero-shot depth estimation and further optimizations to reduce computational overhead while maintaining high visual fidelity across diverse datasets. Additional studies could also focus on integrating depth priors or leveraging advanced neural architectures to improve cross-domain generalizability and real-time performance.

Implications

FreeSplat's advancements have noteworthy implications for various applications, including virtual reality, augmented reality, and photorealistic scene reconstruction. The framework's efficiency and adaptability make it particularly suited for interactive systems that require rapid rendering of high-quality views from multiple perspectives. The elimination of redundant Gaussians and robust handling of long sequences highlight FreeSplat's potential to impact real-world scenarios where computational resources and real-time performance are critical.

X Twitter Logo Streamline Icon: https://streamlinehq.com