LRM: Large Reconstruction Model for Single Image to 3D (2311.04400v2)

Published 8 Nov 2023 in cs.CV, cs.AI, cs.GR, and cs.LG

Abstract: We propose the first Large Reconstruction Model (LRM) that predicts the 3D model of an object from a single input image within just 5 seconds. In contrast to many previous methods that are trained on small-scale datasets such as ShapeNet in a category-specific fashion, LRM adopts a highly scalable transformer-based architecture with 500 million learnable parameters to directly predict a neural radiance field (NeRF) from the input image. We train our model in an end-to-end manner on massive multi-view data containing around 1 million objects, including both synthetic renderings from Objaverse and real captures from MVImgNet. This combination of a high-capacity model and large-scale training data empowers our model to be highly generalizable and produce high-quality 3D reconstructions from various testing inputs, including real-world in-the-wild captures and images created by generative models. Video demos and interactable 3D meshes can be found on our LRM project webpage: https://yiconghong.me/LRM.

Citations (250)

View on Semantic Scholar

Summary

The paper demonstrates that a transformer-based encoder-decoder architecture with 500M parameters efficiently translates 2D image features into detailed 3D structures.
LRM leverages a hybrid dataset of synthetic and real images to achieve scalable and rapid 3D reconstructions in under five seconds per image.
The model shifts from traditional 3D-aware regularization to 2D multi-view losses, paving the way for versatile AI-driven applications in AR/VR and industrial design.

Overview of LRM: Large Reconstruction Model for Single Image to 3D

The paper introduces a novel approach to 3D reconstruction from a single image using a model termed LRM, the Large Reconstruction Model. This model addresses a well-recognized computational challenge in the field: predicting the 3D structure of an object from a singular 2D viewpoint. The core methodology hinges on integrating a large-scale transformer architecture to reinterpret images into 3D models.

Key Contributions

LRM makes several technically profound contributions that advance the state-of-the-art in the domain of image-to-3D conversion:

Model Architecture: LRM employs a transformer-based encoder-decoder architecture incorporating 500 million parameters. The encoder utilizes the pre-trained vision transformer, DINO, to encode 2D image features. These features are subsequently projected into a 3D triplane space by a novel image-to-triplane transformer decoder.
Data Utilization: A significant strength of LRM is its leverage on large-scale datasets, combining synthetic and real data, totaling around one million objects. This approach ensures high generalizability and superior performance across a large array of input images.
Scalability and Efficiency: The model demonstrates impressive efficiency with its ability to generate a high-fidelity 3D reconstruction in under five seconds per image using cutting-edge hardware. Such rapid processing is pivotal for real-world applications spanning industrial design, AR/VR, and gaming.
Training Paradigm: LRM is trained with an emphasis on 2D multi-view reconstruction losses rather than conventional 3D-aware regularization, which streamlines the model's training phase.

Theoretical and Practical Implications

LRM's methodology holds substantial theoretical implications. It challenges the traditional reliance on category-specific models by proposing a generic solution that interpolates across various object categories. This architectural shift towards a unified transformer model parallels innovations in NLP and image processing, suggesting potential parallels for future linguistic-to-vision model integrations.

Practically, LRM's efficiency and generalizability make it highly suitable for integration into real-time applications requiring on-the-fly 3D modeling from limited input data. Moreover, its low dependency on manually tuned parameters enables easier deployment across diverse computing environments.

Numerical Results and Impact

The quantitative evaluations presented in the paper underscore LRM's efficacy. It achieves enhanced performance over state-of-the-art methods, particularly noticeable in metrics such as FID, CLIP-Similarity, and Chamfer Distance when evaluated on the Google Scanned Objects dataset. These metrics highlight LRM's ability to maintain high texture and detail fidelity in reconstructed 3D models.

Future Directions

The paper outlines two promising future trajectories: scaling LRM with an even larger dataset and network size, and extending its functionalities towards multimodal 3D generative models. These pathways hint at expansive possibilities for integrating LRM into broader AI systems, potentially enabling seamless text-to-3D synthesis and interaction.

Conclusion

LRM represents a significant stride towards generalized, scalable, and efficient models for 3D reconstruction from single images. It sets a foundation that challenges existing paradigms by leveraging the power of transformer architectures, and signals a pivotal shift in how machine learning approaches can address complex 3D modeling tasks. As computing resources and accessible datasets grow, LRM's foundational principles will likely guide future innovations in the rapidly evolving field of AI-driven 3D reconstruction.

PDF Markdown

Related Papers

Tweets

https://twitter.com/Jin__Sugimoto/status/1767036185770393631

https://twitter.com/YifanJiang17/status/1780126814050062468

YouTube

Show All Videos