Gamba: Marry Gaussian Splatting with Mamba for single view 3D reconstruction

Published 27 Mar 2024 in cs.CV and cs.AI | (2403.18795v3)

Abstract: We tackle the challenge of efficiently reconstructing a 3D asset from a single image at millisecond speed. Existing methods for single-image 3D reconstruction are primarily based on Score Distillation Sampling (SDS) with Neural 3D representations. Despite promising results, these approaches encounter practical limitations due to lengthy optimizations and significant memory consumption. In this work, we introduce Gamba, an end-to-end 3D reconstruction model from a single-view image, emphasizing two main insights: (1) Efficient Backbone Design: introducing a Mamba-based GambaFormer network to model 3D Gaussian Splatting (3DGS) reconstruction as sequential prediction with linear scalability of token length, thereby accommodating a substantial number of Gaussians; (2) Robust Gaussian Constraints: deriving radial mask constraints from multi-view masks to eliminate the need for warmup supervision of 3D point clouds in training. We trained Gamba on Objaverse and assessed it against existing optimization-based and feed-forward 3D reconstruction approaches on the GSO Dataset, among which Gamba is the only end-to-end trained single-view reconstruction model with 3DGS. Experimental results demonstrate its competitive generation capabilities both qualitatively and quantitatively and highlight its remarkable speed: Gamba completes reconstruction within 0.05 seconds on a single NVIDIA A100 GPU, which is about $1,000\times$ faster than optimization-based methods. Please see our project page at https://florinshen.github.io/gamba-project.

Abstract PDF HTML Upgrade to Chat

References (68)

Citations (28)

View on Semantic Scholar

Summary

The paper introduces Gamba, integrating 3D Gaussian splatting and Mamba architecture to rapidly generate high-quality 3D assets.
It employs an end-to-end network that transforms single images into tokenized 3D representations processed by sequential Mamba-based blocks.
Evaluations on OmniObject3D show competitive performance with reconstruction speeds of 0.6 sec on an NVIDIA A100 GPU, emphasizing its practical efficiency.

Gamba: A Novel Approach for Single-View 3D Reconstruction via Amortized 3D Gaussian Splatting and Mamba

Introduction to Gamba

In the field of 3D content creation, the ability to efficiently reconstruct 3D assets from single images is increasingly paramount, driven by the growing demand in industries such as AR/VR and autonomous vehicle navigation. Despite significant advancements, existing methodologies predominantly utilizing Score Distillation Sampling (SDS) and Neural Radiance Fields (NeRF) face limitations in terms of optimization time and computational resource demands. Addressing these challenges, we introduce Gamba, an end-to-end model that seamlessly integrates 3D Gaussian splatting (3DGS) with the Mamba architecture for single-view image-based 3D reconstruction. Gamba distinguishes itself by leveraging the efficient 3D representation capabilities of 3DGS and the scalability of Mamba, facilitating fast and high-quality 3D asset generation.

Key Contributions

3D Representation with 3D Gaussian Splatting: Gamba utilizes an extensive set of 3D Gaussians, efficiently reconstructing 3D assets via 3D Gaussian splatting. This method ensures a memory-efficient and high-fidelity rendering process, crucial for practical applications.
Mamba-Based Backbone Design: At the heart of Gamba lies the Mamba-based sequential network, which enables context-dependent reasoning and exhibits linear scalability with sequence length. This design accommodates a substantial number of Gaussians, addressing the inadequacies of transformer-based architectures in generating 3DGS due to their quadratic complexity with token count.
Robust Data Preprocessing and Regularization: The development of Gamba involves careful consideration of data preprocessing and regularization techniques, enhancing the stability and quality of 3D reconstruction outcomes.
Efficient and High-Quality Reconstruction: Demonstrated on the OmniObject3D dataset, Gamba shows competitive performance in generating high-quality 3D assets, achieving remarkable speed approximately 0.6 second on a single NVIDIA A100 GPU.

The Gamba Architecture

Gamba's architecture illustrates a forward-thinking approach to single-view 3D reconstruction. The model first transforms the input image and associated camera pose into a series of tokens, which, alongside a set of learnable 3DGS embeddings, are processed through the GambaFormer (a series of Mamba-based blocks). This sequential processing mimics the natural reconstruction process of 3DGS, enabling the efficient generation of 3D assets. The Gaussian Decoder within Gamba then predicts specific 3D Gaussian parameters, facilitating the rendering of these parameters into multi-view images for direct supervision.

Evaluation and Implications

Gamba's performance was thoroughly evaluated against existing methods on the OmniObject3D dataset, showcasing its superior generation capabilities and efficiency. The model demonstrates competitive performance in both qualitative and quantitative measures, providing a significant speed advantage over existing methods. These achievements underscore the practicality of Gamba in automating 3D content creation pipelines, offering a viable solution to the industry's demand for fast and high-quality single-view 3D reconstructions.

Future Directions and Speculations

The introduction of Gamba opens several avenues for future research and development in AI and 3D modeling. One potential direction is exploring the adaptability of the Gamba architecture to other forms of 3D representations and reconstruction tasks. Additionally, the scalability benefits offered by the combination of 3DGS and Mamba suggest possibilities for extending Gamba to more complex scenes and objects, further enhancing the realism and utility of generated 3D assets.

Markdown Report Issue