Instant3D: Instant Text-to-3D Generation (2311.08403v2)

Published 14 Nov 2023 in cs.CV, cs.AI, cs.GR, cs.LG, and cs.MM

Abstract: Text-to-3D generation has attracted much attention from the computer vision community. Existing methods mainly optimize a neural field from scratch for each text prompt, relying on heavy and repetitive training cost which impedes their practical deployment. In this paper, we propose a novel framework for fast text-to-3D generation, dubbed Instant3D. Once trained, Instant3D is able to create a 3D object for an unseen text prompt in less than one second with a single run of a feedforward network. We achieve this remarkable speed by devising a new network that directly constructs a 3D triplane from a text prompt. The core innovation of our Instant3D lies in our exploration of strategies to effectively inject text conditions into the network. In particular, we propose to combine three key mechanisms: cross-attention, style injection, and token-to-plane transformation, which collectively ensure precise alignment of the output with the input text. Furthermore, we propose a simple yet effective activation function, the scaled-sigmoid, to replace the original sigmoid function, which speeds up the training convergence by more than ten times. Finally, to address the Janus (multi-head) problem in 3D generation, we propose an adaptive Perp-Neg algorithm that can dynamically adjust its concept negation scales according to the severity of the Janus problem during training, effectively reducing the multi-head effect. Extensive experiments on a wide variety of benchmark datasets demonstrate that the proposed algorithm performs favorably against the state-of-the-art methods both qualitatively and quantitatively, while achieving significantly better efficiency. The code, data, and models are available at https://github.com/ming1993li/Instant3DCodes.

PDF Abstract

Instant3D: Instant Text-to-3D Generation

The paper presents Instant3D, a novel framework for rapid text-to-3D generation aimed at overcoming the substantial computational expense associated with optimizing neural fields for each text prompt from scratch. The key contribution of Instant3D is its ability to generate 3D objects for previously unseen text prompts in less than one second using a single run of a feedforward network, providing significant efficiency gains compared to existing methodologies.

Methodology and Innovations

Instant3D introduces a specialized network architecture designed to construct a triplane representation of a 3D object directly from a text prompt. The fundamental innovation lies in the effective integration of text conditions into the network through three key mechanisms: cross-attention, style injection, and token-to-plane transformation. These mechanisms collectively ensure that the generated 3D model aligns accurately with the text input.

A noteworthy feature of Instant3D is its scaled-sigmoid activation function, which replaces the standard sigmoid function to enhance training speed by more than ten times, facilitating quicker convergence. Another critical advancement is the adaptive Perp-Neg algorithm, which dynamically adjusts the concept negation scales to address the Janus problem in 3D generation. The Janus problem involves generating objects with multiple redundant heads, often due to biased training data. The adaptive approach ensures optimal negation scales based on the severity of the issue during training, resulting in more cohesive 3D outputs.

Experimental Results

Extensive experimentation demonstrates that Instant3D performs favorably against state-of-the-art text-to-3D generation methods on various benchmark datasets, achieving both qualitative and quantitative improvements. The framework provides high-quality renderings with significantly improved efficiency, requiring dramatically fewer iterations compared to the thousands needed by previous methods.

Moreover, Instant3D shows robust generalization capabilities, successfully generating 3D objects across diverse prompts, from structured compositions like those in the Animals and Portraits datasets to the complex, varied sentences of the Daily Life set. This adaptability highlights Instant3D's capacity to leverage shared 3D priors across different objects, a feature not available in models that optimize separately for each prompt.

Implications and Future Directions

Instant3D significantly advances the practical application of text-driven 3D generation by alleviating computational burdens and enabling near-instantaneous object synthesis. It paves the way for more efficient integration of 3D modeling in areas like virtual reality and animation, where speed is paramount. Additionally, the architectural principles outlined could inspire further refinement of generative models, particularly in exploring the balance between condition mechanisms and their computational costs.

Future work could explore refining condition mechanisms to further enhance quality and efficiency, explore integration with real-time rendering technologies, and examine broader applications within interactive media environments. As advancements in computation continue, Instant3D's framework could evolve to address more complex generation tasks, integrating richer datasets and multi-modal inputs.

In summary, Instant3D offers a substantial leap forward in text-to-3D generation, providing a foundation for efficient, high-quality 3D content creation driven by textual input.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Ming Li (787 papers)
Pan Zhou (220 papers)
Jia-Wei Liu (20 papers)
Jussi Keppo (11 papers)
Min Lin (96 papers)
Shuicheng Yan (275 papers)
Xiangyu Xu (48 papers)

Citations (19)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - ming1993li/Instant3DCodes (81 stars)

Instant3D: Instant Text-to-3D Generation (2311.08403v2)