LATTE3D: Large-scale Amortized Text-To-Enhanced3D Synthesis (2403.15385v1)

Published 22 Mar 2024 in cs.CV, cs.AI, cs.GR, and cs.LG

Abstract: Recent text-to-3D generation approaches produce impressive 3D results but require time-consuming optimization that can take up to an hour per prompt. Amortized methods like ATT3D optimize multiple prompts simultaneously to improve efficiency, enabling fast text-to-3D synthesis. However, they cannot capture high-frequency geometry and texture details and struggle to scale to large prompt sets, so they generalize poorly. We introduce LATTE3D, addressing these limitations to achieve fast, high-quality generation on a significantly larger prompt set. Key to our method is 1) building a scalable architecture and 2) leveraging 3D data during optimization through 3D-aware diffusion priors, shape regularization, and model initialization to achieve robustness to diverse and complex training prompts. LATTE3D amortizes both neural field and textured surface generation to produce highly detailed textured meshes in a single forward pass. LATTE3D generates 3D objects in 400ms, and can be further enhanced with fast test-time optimization.

PDF Abstract

Overview of LATTE3D: A Large-Scale Amortized Text-to-3D Model

The paper, "LATTE3D," introduces a novel approach to text-to-3D synthesis that significantly improves efficiency and scalability while maintaining high-quality output. LATTE3D, or Large-scale Amortized Text-to-Enhanced 3D synthesis, focuses on overcoming limitations related to text-to-3D models by addressing issues in geometry and texture detail, optimization time, and prompt generalization.

Methodology

The authors propose a scalable architecture with two novel components: text-to-3D amortization and the utilization of 3D data optimization. The architecture functions in two stages:

Stage-1: A volumetric geometry generation step that employs a triplane representation and MVDream's 3D-aware Stable Diffusion (SDS) as a supervisory signal. The design leverages pretraining for shape reconstruction and an input point cloud annealing strategy to enhance training stability.
Stage-2: Refines and upscales the generated texture using depth-conditional ControlNet guidance, enhancing image fidelity and maintaining geometric alignment with the inputs from the previous stage.

LATTE3D integrates a large-scale prompt dataset and applies these techniques to permit real-time generation and prompt robustness across diverse inputs. The methodology not only cuts down generation time to ~400ms using a single GPU but also accommodates additional test-time optimization for further quality enhancement.

Results and Implications

Quantitative results demonstrate that LATTE3D yields competitive results with state-of-the-art methods while reducing computational expense significantly. The model achieves near real-time 3D generation, suitable for rapid iteration in content creation. Its responsive optimization serves practical applications in industries reliant on 3D content, from video game development to digital art production.

Furthermore, the incorporation of 3D-aware priors and shape regularization underscores theoretical advancements in exploiting volumetric data. These contribute to the broader discourse on enhancing model robustness and efficiency in 3D synthesis tasks.

Challenges and Future Directions

Despite its success, LATTE3D's potential challenges include:

Prompt Compositionality: Difficulties persist in generating accurate results from complex or multi-object prompts.
Geometry-Freezing in Stage-2: Though advantageous for stability, this might limit flexibility in correcting geometry issues post Stage-1 processing.

Future work could involve refining geometry during the texture refinement phase, improving prompt interpretation in complex scenes, and expanding the generalization capabilities to even more extensive prompt sets. Additionally, research could investigate more profound integration of multi-modal data augmenting linguistic inputs with visual context.

Conclusion

LATTE3D presents a significant step forward in text-to-3D synthesis, proving that large-scale, amortized learning frameworks can reconcile speed with quality. The work's clear focus on scalability and efficiency is highly relevant for applications needing rapid, high-fidelity 3D outputs and sets a precedent for future research within the space of text-driven 3D content generation.

PDF Markdown Bookmark Chat (Pro)

Authors (8)

Kevin Xie (13 papers)
Jonathan Lorraine (20 papers)
Tianshi Cao (11 papers)
Jun Gao (267 papers)
James Lucas (24 papers)
Antonio Torralba (178 papers)
Sanja Fidler (184 papers)
Xiaohui Zeng (28 papers)

Citations (24)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

YouTube

Show All Videos