Papers
Topics
Authors
Recent
Search
2000 character limit reached

Gamba: Marry Gaussian Splatting with Mamba for single view 3D reconstruction

Published 27 Mar 2024 in cs.CV and cs.AI | (2403.18795v3)

Abstract: We tackle the challenge of efficiently reconstructing a 3D asset from a single image at millisecond speed. Existing methods for single-image 3D reconstruction are primarily based on Score Distillation Sampling (SDS) with Neural 3D representations. Despite promising results, these approaches encounter practical limitations due to lengthy optimizations and significant memory consumption. In this work, we introduce Gamba, an end-to-end 3D reconstruction model from a single-view image, emphasizing two main insights: (1) Efficient Backbone Design: introducing a Mamba-based GambaFormer network to model 3D Gaussian Splatting (3DGS) reconstruction as sequential prediction with linear scalability of token length, thereby accommodating a substantial number of Gaussians; (2) Robust Gaussian Constraints: deriving radial mask constraints from multi-view masks to eliminate the need for warmup supervision of 3D point clouds in training. We trained Gamba on Objaverse and assessed it against existing optimization-based and feed-forward 3D reconstruction approaches on the GSO Dataset, among which Gamba is the only end-to-end trained single-view reconstruction model with 3DGS. Experimental results demonstrate its competitive generation capabilities both qualitatively and quantitatively and highlight its remarkable speed: Gamba completes reconstruction within 0.05 seconds on a single NVIDIA A100 GPU, which is about $1,000\times$ faster than optimization-based methods. Please see our project page at https://florinshen.github.io/gamba-project.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (68)
  1. Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  5855–5864, 2021.
  2. pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction. arXiv preprint arXiv:2312.12337, 2023.
  3. A survey on 3d gaussian splatting. arXiv preprint arXiv:2401.03890, 2024.
  4. François Chollet. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  1251–1258, 2017.
  5. Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  13142–13153, 2023.
  6. Centernet: Keypoint triplets for object detection. In Proceedings of the IEEE/CVF international conference on computer vision, pp.  6569–6578, 2019.
  7. William Glasser. Control theory. Harper and Row New York, 1985.
  8. Diffusion models as plug-and-play priors. Advances in Neural Information Processing Systems, 35:14715–14728, 2022.
  9. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023a.
  10. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023b.
  11. Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396, 2021a.
  12. Combining recurrent, convolutional, and continuous-time models with linear state space layers. Advances in neural information processing systems, 34:572–585, 2021b.
  13. On the parameterization and initialization of diagonal state space models. Advances in Neural Information Processing Systems, 35:35971–35983, 2022a.
  14. How to train your hippo: State space models with generalized orthogonal basis projections. arXiv preprint arXiv:2206.12037, 2022b.
  15. A comprehensive study for robot navigation techniques. Cogent Engineering, 6(1):1632046, 2019.
  16. threestudio: A unified framework for 3d content generation. https://github.com/threestudio-project/threestudio, 2023.
  17. Debiasing scores and prompts of 2d diffusion for robust text-to-3d generation. arXiv preprint arXiv:2303.15413, 2023a.
  18. Lrm: Large reconstruction model for single image to 3d. arXiv preprint arXiv:2311.04400, 2023b.
  19. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics (ToG), 42(4):1–14, 2023.
  20. Instant3d: Fast text-to-3d with sparse-view generation and large reconstruction model. arXiv preprint arXiv:2311.06214, 2023.
  21. Pointmamba: A simple state space model for point cloud analysis. arXiv preprint arXiv:2402.10739, 2024a.
  22. Pointmamba: A simple state space model for point cloud analysis. arXiv preprint arXiv:2402.10739, 2024b.
  23. Magic3d: High-resolution text-to-3d content creation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  300–309, 2023.
  24. One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimization. arXiv preprint arXiv:2306.16928, 2023a.
  25. Zero-1-to-3: Zero-shot one image to 3d object. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  9298–9309, 2023b.
  26. Vmamba: Visual state space model. arXiv preprint arXiv:2401.10166, 2024.
  27. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  28. U-mamba: Enhancing long-range dependency for biomedical image segmentation. arXiv preprint arXiv:2401.04722, 2024.
  29. Long range language modeling via gated state spaces. arXiv preprint arXiv:2206.13947, 2022.
  30. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021.
  31. Polygen: An autoregressive generative model of 3d meshes. In International conference on machine learning, pp.  7220–7229. PMLR, 2020.
  32. Point-e: A system for generating 3d point clouds from complex prompts. arXiv preprint arXiv:2212.08751, 2022.
  33. Dinov2: Learning robust visual features without supervision, 2023.
  34. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  4195–4205, 2023.
  35. Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988, 2022.
  36. Magic123: One image to high-quality 3d object generation using both 2d and 3d diffusion priors. arXiv preprint arXiv:2306.17843, 2023.
  37. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.  8748–8763. PMLR, 2021.
  38. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  10684–10695, 2022.
  39. Vm-unet: Vision mamba unet for medical image segmentation. arXiv preprint arXiv:2402.02491, 2024.
  40. Deep marching tetrahedra: a hybrid representation for high-resolution 3d shape synthesis. Advances in Neural Information Processing Systems, 34:6087–6101, 2021.
  41. Mvdream: Multi-view diffusion for 3d generation. arXiv preprint arXiv:2308.16512, 2023.
  42. 3d neural field generation using triplane diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  20875–20886, June 2023.
  43. Trosd: A new rgb-d dataset for transparent and reflective object segmentation in practice. IEEE Transactions on Circuits and Systems for Video Technology, 2023.
  44. Splatter image: Ultra-fast single-view 3d reconstruction. arXiv preprint arXiv:2312.13150, 2023.
  45. Jiaxiang Tang. Stable-dreamfusion: Text-to-3d with stable-diffusion, 2022. https://github.com/ashawkey/stable-dreamfusion.
  46. Dreamgaussian: Generative gaussian splatting for efficient 3d content creation. arXiv preprint arXiv:2309.16653, 2023.
  47. Lgm: Large multi-view gaussian model for high-resolution 3d content creation. arXiv preprint arXiv:2402.05054, 2024.
  48. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  49. Selective structured state-spaces for long-form video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  6387–6397, 2023a.
  50. Imagedream: Image-prompt multi-view diffusion for 3d generation. arXiv preprint arXiv:2312.02201, 2023.
  51. Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction. arXiv preprint arXiv:2106.10689, 2021.
  52. Pf-lrm: Pose-free large reconstruction model for joint pose and shape prediction. arXiv preprint arXiv:2311.12024, 2023b.
  53. Neus2: Fast learning of neural implicit surfaces for multi-view reconstruction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  3295–3306, 2023c.
  54. View-consistent 3d editing with gaussian splatting. arXiv preprint arXiv:2403.11868, 2024.
  55. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. arXiv preprint arXiv:2305.16213, 2023d.
  56. Consistent123: Improve consistency for one image to 3d object synthesis. arXiv preprint arXiv:2310.08092, 2023.
  57. Omniobject3d: Large-vocabulary 3d object dataset for realistic perception, reconstruction and generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  803–814, 2023.
  58. Consistent3d: Towards consistent high-fidelity text-to-3d generation with deterministic sampling prior. arXiv preprint arXiv:2401.09050, 2024.
  59. Agg: Amortized generative 3d gaussians for single image to 3d. arXiv preprint arXiv:2401.04099, 2024.
  60. Dmv3d: Denoising multi-view diffusion using 3d large reconstruction model. arXiv preprint arXiv:2311.09217, 2023.
  61. Identifying hard noise in long-tailed sample distribution. In European Conference on Computer Vision, pp.  739–756. Springer, 2022.
  62. Invariant training 2d-3d joint hard samples for few-shot point cloud recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  14463–14474, 2023.
  63. Mvimgnet: A large-scale dataset of multi-view images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  9150–9161, 2023.
  64. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  586–595, 2018.
  65. Point could mamba: Point cloud learning via state space model, 2024.
  66. Vision mamba: Efficient visual representation learning with bidirectional state space model. arXiv preprint arXiv:2401.09417, 2024.
  67. Triplane meets gaussian splatting: Fast and generalizable single-view 3d reconstruction with transformers. arXiv preprint arXiv:2312.09147, 2023.
  68. Ewa splatting. IEEE Transactions on Visualization and Computer Graphics, 8(3):223–238, 2002.
Citations (28)

Summary

  • The paper introduces Gamba, integrating 3D Gaussian splatting and Mamba architecture to rapidly generate high-quality 3D assets.
  • It employs an end-to-end network that transforms single images into tokenized 3D representations processed by sequential Mamba-based blocks.
  • Evaluations on OmniObject3D show competitive performance with reconstruction speeds of 0.6 sec on an NVIDIA A100 GPU, emphasizing its practical efficiency.

Gamba: A Novel Approach for Single-View 3D Reconstruction via Amortized 3D Gaussian Splatting and Mamba

Introduction to Gamba

In the field of 3D content creation, the ability to efficiently reconstruct 3D assets from single images is increasingly paramount, driven by the growing demand in industries such as AR/VR and autonomous vehicle navigation. Despite significant advancements, existing methodologies predominantly utilizing Score Distillation Sampling (SDS) and Neural Radiance Fields (NeRF) face limitations in terms of optimization time and computational resource demands. Addressing these challenges, we introduce Gamba, an end-to-end model that seamlessly integrates 3D Gaussian splatting (3DGS) with the Mamba architecture for single-view image-based 3D reconstruction. Gamba distinguishes itself by leveraging the efficient 3D representation capabilities of 3DGS and the scalability of Mamba, facilitating fast and high-quality 3D asset generation.

Key Contributions

  • 3D Representation with 3D Gaussian Splatting: Gamba utilizes an extensive set of 3D Gaussians, efficiently reconstructing 3D assets via 3D Gaussian splatting. This method ensures a memory-efficient and high-fidelity rendering process, crucial for practical applications.
  • Mamba-Based Backbone Design: At the heart of Gamba lies the Mamba-based sequential network, which enables context-dependent reasoning and exhibits linear scalability with sequence length. This design accommodates a substantial number of Gaussians, addressing the inadequacies of transformer-based architectures in generating 3DGS due to their quadratic complexity with token count.
  • Robust Data Preprocessing and Regularization: The development of Gamba involves careful consideration of data preprocessing and regularization techniques, enhancing the stability and quality of 3D reconstruction outcomes.
  • Efficient and High-Quality Reconstruction: Demonstrated on the OmniObject3D dataset, Gamba shows competitive performance in generating high-quality 3D assets, achieving remarkable speed approximately 0.6 second on a single NVIDIA A100 GPU.

The Gamba Architecture

Gamba's architecture illustrates a forward-thinking approach to single-view 3D reconstruction. The model first transforms the input image and associated camera pose into a series of tokens, which, alongside a set of learnable 3DGS embeddings, are processed through the GambaFormer (a series of Mamba-based blocks). This sequential processing mimics the natural reconstruction process of 3DGS, enabling the efficient generation of 3D assets. The Gaussian Decoder within Gamba then predicts specific 3D Gaussian parameters, facilitating the rendering of these parameters into multi-view images for direct supervision.

Evaluation and Implications

Gamba's performance was thoroughly evaluated against existing methods on the OmniObject3D dataset, showcasing its superior generation capabilities and efficiency. The model demonstrates competitive performance in both qualitative and quantitative measures, providing a significant speed advantage over existing methods. These achievements underscore the practicality of Gamba in automating 3D content creation pipelines, offering a viable solution to the industry's demand for fast and high-quality single-view 3D reconstructions.

Future Directions and Speculations

The introduction of Gamba opens several avenues for future research and development in AI and 3D modeling. One potential direction is exploring the adaptability of the Gamba architecture to other forms of 3D representations and reconstruction tasks. Additionally, the scalability benefits offered by the combination of 3DGS and Mamba suggest possibilities for extending Gamba to more complex scenes and objects, further enhancing the realism and utility of generated 3D assets.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 7 tweets with 306 likes about this paper.