Long-LRM: Long-sequence Large Reconstruction Model for Wide-coverage Gaussian Splats (2410.12781v1)

Published 16 Oct 2024 in cs.CV

Abstract: We propose Long-LRM, a generalizable 3D Gaussian reconstruction model that is capable of reconstructing a large scene from a long sequence of input images. Specifically, our model can process 32 source images at 960x540 resolution within only 1.3 seconds on a single A100 80G GPU. Our architecture features a mixture of the recent Mamba2 blocks and the classical transformer blocks which allowed many more tokens to be processed than prior work, enhanced by efficient token merging and Gaussian pruning steps that balance between quality and efficiency. Unlike previous feed-forward models that are limited to processing 1~4 input images and can only reconstruct a small portion of a large scene, Long-LRM reconstructs the entire scene in a single feed-forward step. On large-scale scene datasets such as DL3DV-140 and Tanks and Temples, our method achieves performance comparable to optimization-based approaches while being two orders of magnitude more efficient. Project page: https://arthurhero.github.io/projects/llrm

Citations (5)

View on Semantic Scholar

Summary

The paper introduces Long-LRM, which reduces 3D scene reconstruction time from 13 minutes to 1.3 seconds using 32 high-resolution images.
It employs a hybrid architecture with Mamba2 and transformer blocks to manage long token sequences with linear complexity.
Token merging and Gaussian pruning techniques balance computational efficiency with high-quality reconstruction on large-scale datasets.

Overview of Long-LRM: A Generalizable 3D Gaussian Reconstruction Model

The paper introduces Long-LRM, a sophisticated 3D Gaussian reconstruction model designed to efficiently reconstruct large scenes from a long sequence of input images. This model stands out because it can process up to 32 source images at high resolutions swiftly on a single A100 80G GPU, specifically achieving this in only 1.3 seconds.

Core Contributions

Long-LRM departs from traditional methods by integrating both Mamba2 and transformer blocks into its architecture. This hybrid integration is crucial as it allows the model to process a greater number of tokens than preceding models, leading to enhanced efficiency and quality. The authors further optimize performance with innovative token merging and Gaussian pruning techniques, which balance between maintaining reconstruction quality and computational efficiency.

This model distinguishes itself by its capability to reconstruct entire large-scale scenes in a single feed-forward operation. Previous models were restricted to a limited number of images and could only reconstruct segments of a scene. Long-LRM's competence extends over large-scale scene datasets, such as DL3DV-140 and Tanks and Temples, demonstrating efficiency improvements by two orders of magnitude while delivering optimization-based reconstruction quality.

Technical Approach

The authors adopt a novel approach where input images are patchified into sequences of patch tokens—a strategy inspired by GS-LRM. They frame the problem of GS reconstruction as sequence-to-sequence translation, with the aim of regressing pixel-aligned Gaussian primitives. The model operates under a challenging context length due to the large token sequence generated by 32 high-resolution images. The coupling of Mamba2 blocks and transformer blocks is instrumental, managing these long sequences with linear complexity while preserving global context.

The model employs token merging modules to further diminish token numbers mid-processing. Additionally, Gaussian pruning is implemented to manage the dense distribution of per-pixel Gaussians, increasing both efficiency and model training speed.

Numerical Results and Implications

The authors present an impressive comparison of Long-LRM with optimization-based 3D Gaussian splatting. Long-LRM maintains comparable quality in novel view synthesis while drastically reducing reconstruction time from 13 minutes to just 1.3 seconds. This efficiency translates into significant practical benefits for applications requiring rapid scene reconstruction.

The integration of Mamba2 enhances the model’s ability to handle long-sequence contexts, a common hurdle in large-scale scene reconstruction. The model’s performance on DL3DV-140 and Tanks and Temples datasets emphasizes its scalability and potential for wide-ranging applications in 3D content creation, virtual and augmented reality, and beyond.

Future Directions

While Long-LRM achieves notable advances, further research could explore scaling up the context length to support even more input views and higher resolutions without compromising efficiency. Additionally, expanding the model’s applicability across diverse datasets, especially those with varying field-of-view parameters, could enhance its generalizability.

Long-LRM sets a new benchmark in the field of feed-forward 3D Gaussian splatting, opening avenues for future explorations in AI-driven scene reconstruction. The hybrid architectural approach presents a template that could be extrapolated to other complex visual tasks, contributing to the ongoing evolution in the field of computer vision.

Related Papers

GitHub

Long-LRM

Tweets

https://twitter.com/janusch_patas/status/1846801702504034715

https://twitter.com/zhenjun_zhao/status/1846769600731795895

https://twitter.com/taziku_co/status/1847222052216000859

https://twitter.com/Almorgand/status/1846971154583720221

https://twitter.com/arXivGPT/status/1847759601871048788

https://twitter.com/arXivGPT/status/1848494716595081400