- The paper introduces Long-LRM, which reduces 3D scene reconstruction time from 13 minutes to 1.3 seconds using 32 high-resolution images.
- It employs a hybrid architecture with Mamba2 and transformer blocks to manage long token sequences with linear complexity.
- Token merging and Gaussian pruning techniques balance computational efficiency with high-quality reconstruction on large-scale datasets.
Overview of Long-LRM: A Generalizable 3D Gaussian Reconstruction Model
The paper introduces Long-LRM, a sophisticated 3D Gaussian reconstruction model designed to efficiently reconstruct large scenes from a long sequence of input images. This model stands out because it can process up to 32 source images at high resolutions swiftly on a single A100 80G GPU, specifically achieving this in only 1.3 seconds.
Core Contributions
Long-LRM departs from traditional methods by integrating both Mamba2 and transformer blocks into its architecture. This hybrid integration is crucial as it allows the model to process a greater number of tokens than preceding models, leading to enhanced efficiency and quality. The authors further optimize performance with innovative token merging and Gaussian pruning techniques, which balance between maintaining reconstruction quality and computational efficiency.
This model distinguishes itself by its capability to reconstruct entire large-scale scenes in a single feed-forward operation. Previous models were restricted to a limited number of images and could only reconstruct segments of a scene. Long-LRM's competence extends over large-scale scene datasets, such as DL3DV-140 and Tanks and Temples, demonstrating efficiency improvements by two orders of magnitude while delivering optimization-based reconstruction quality.
Technical Approach
The authors adopt a novel approach where input images are patchified into sequences of patch tokens—a strategy inspired by GS-LRM. They frame the problem of GS reconstruction as sequence-to-sequence translation, with the aim of regressing pixel-aligned Gaussian primitives. The model operates under a challenging context length due to the large token sequence generated by 32 high-resolution images. The coupling of Mamba2 blocks and transformer blocks is instrumental, managing these long sequences with linear complexity while preserving global context.
The model employs token merging modules to further diminish token numbers mid-processing. Additionally, Gaussian pruning is implemented to manage the dense distribution of per-pixel Gaussians, increasing both efficiency and model training speed.
Numerical Results and Implications
The authors present an impressive comparison of Long-LRM with optimization-based 3D Gaussian splatting. Long-LRM maintains comparable quality in novel view synthesis while drastically reducing reconstruction time from 13 minutes to just 1.3 seconds. This efficiency translates into significant practical benefits for applications requiring rapid scene reconstruction.
The integration of Mamba2 enhances the model’s ability to handle long-sequence contexts, a common hurdle in large-scale scene reconstruction. The model’s performance on DL3DV-140 and Tanks and Temples datasets emphasizes its scalability and potential for wide-ranging applications in 3D content creation, virtual and augmented reality, and beyond.
Future Directions
While Long-LRM achieves notable advances, further research could explore scaling up the context length to support even more input views and higher resolutions without compromising efficiency. Additionally, expanding the model’s applicability across diverse datasets, especially those with varying field-of-view parameters, could enhance its generalizability.
Long-LRM sets a new benchmark in the field of feed-forward 3D Gaussian splatting, opening avenues for future explorations in AI-driven scene reconstruction. The hybrid architectural approach presents a template that could be extrapolated to other complex visual tasks, contributing to the ongoing evolution in the field of computer vision.