Controllable Text-to-3D Generation via Surface-Aligned Gaussian Splatting (2403.09981v2)

Published 15 Mar 2024 in cs.CV

Abstract: While text-to-3D and image-to-3D generation tasks have received considerable attention, one important but under-explored field between them is controllable text-to-3D generation, which we mainly focus on in this work. To address this task, 1) we introduce Multi-view ControlNet (MVControl), a novel neural network architecture designed to enhance existing pre-trained multi-view diffusion models by integrating additional input conditions, such as edge, depth, normal, and scribble maps. Our innovation lies in the introduction of a conditioning module that controls the base diffusion model using both local and global embeddings, which are computed from the input condition images and camera poses. Once trained, MVControl is able to offer 3D diffusion guidance for optimization-based 3D generation. And, 2) we propose an efficient multi-stage 3D generation pipeline that leverages the benefits of recent large reconstruction models and score distillation algorithm. Building upon our MVControl architecture, we employ a unique hybrid diffusion guidance method to direct the optimization process. In pursuit of efficiency, we adopt 3D Gaussians as our representation instead of the commonly used implicit representations. We also pioneer the use of SuGaR, a hybrid representation that binds Gaussians to mesh triangle faces. This approach alleviates the issue of poor geometry in 3D Gaussians and enables the direct sculpting of fine-grained geometry on the mesh. Extensive experiments demonstrate that our method achieves robust generalization and enables the controllable generation of high-quality 3D content.

PDF HTML Abstract

Controllable Text-to-3D Generation via Surface-Aligned Gaussian Splatting

The paper "Controllable Text-to-3D Generation via Surface-Aligned Gaussian Splatting" addresses the under-explored field of controllable text-to-3D generation. This approach combines developments in image diffusion models and 3D representations to provide a comprehensive solution for generating high-quality 3D models from text prompts and additional conditions such as depth or normal maps. The research introduces several technical innovations, particularly in the architecture and pipeline design, to achieve this goal efficiently.

The core innovation of this work is the introduction of the Multi-view ControlNet (MVControl), a neural network architecture optimized for controllable text-to-multi-view image generation. MVControl builds upon existing pre-trained multi-view diffusion models and enhances them by incorporating additional spatial embeddings. Unlike traditional models that focus on static text-to-image translation, MVControl employs both local and global embeddings computed from condition images and viewing angles, enabling more nuanced control over the 3D generation process. Through this novel conditioning strategy, the method circumvents challenges in achieving view consistency and relative camera alignment in multi-view scenarios.

Furthermore, the paper advocates an efficient multi-stage 3D generation pipeline that integrates the MVControl architecture with score distillation and large reconstruction models. Key to this pipeline is the adoption of a more efficient 3D representation via Gaussian splatting instead of implicit representations that are computationally intensive. This explicit representation allows for the binding of 3D Gaussians to mesh geometry, facilitating precise control and definition over fine-grained details.

This research yields strong empirical results, demonstrated by the experiments that validate the MVControl’s generalization capabilities across various condition types. Such findings underline the robustness and versatility of their approach in generating controllable and high-quality 3D content. Notably, by comparing their model's outputs against other methods, the authors substantiate the superior quality and fidelity of their approach regarding both geometry and texture of generated 3D assets.

The paper has significant implications for both practical applications and theoretical advancements in AI-driven 3D content generation. Practically, it offers a pathway for efficiently creating detailed 3D models tailored to specific textual descriptions and visual conditions, which could have implications in gaming, digital content creation, and virtual reality. Theoretically, it revives discussions around effective diffusion model conditioning for multi-view scenarios and encourages further exploration into hybrid representation models like SuGaR.

Future developments in AI could take cues from this research in utilizing hybrid diffusion guidance for various multimedia generation tasks. Moreover, further research might investigate enhancing the scalability of MVControl or incorporating richer types of conditions to broaden its applicability.

In conclusion, this paper substantiates the feasibility of controllable text-to-3D generation with substantial improvements in pipeline efficiency and output quality, marking a concrete step towards more interactive and precise generative models in 3D space.