GALA3D: Towards Text-to-3D Complex Scene Generation via Layout-guided Generative Gaussian Splatting (2402.07207v2)

Published 11 Feb 2024 in cs.CV

Abstract: We present GALA3D, generative 3D GAussians with LAyout-guided control, for effective compositional text-to-3D generation. We first utilize LLMs to generate the initial layout and introduce a layout-guided 3D Gaussian representation for 3D content generation with adaptive geometric constraints. We then propose an instance-scene compositional optimization mechanism with conditioned diffusion to collaboratively generate realistic 3D scenes with consistent geometry, texture, scale, and accurate interactions among multiple objects while simultaneously adjusting the coarse layout priors extracted from the LLMs to align with the generated scene. Experiments show that GALA3D is a user-friendly, end-to-end framework for state-of-the-art scene-level 3D content generation and controllable editing while ensuring the high fidelity of object-level entities within the scene. The source codes and models will be available at gala3d.github.io.

PDF Abstract

Overview of GALA3D: Layout-Guided Text-to-3D Generation

The paper presents GALA3D, an advanced framework aimed at generating complex 3D scenes from textual descriptions. It leverages generative 3D Gaussian representations and is guided by layout information derived from LLMs. The approach is distinctive in combining layout priors with adaptive geometry control to enhance the fidelity and consistency of the generated scenes.

GALA3D introduces several novel contributions to the text-to-3D generation landscape. Utilizing LLMs, the framework extracts relationships from text to construct initial scene layouts, which are then refined to better integrate with scene constraints. This represents a significant improvement over manual layout creation, facilitating more efficient and accessible 3D scene creation.

Methodological Contributions

Layout-Guided Gaussian Representation: The framework employs a layout-guided Gaussian representation that adapts to the geometric constraints of the scene. This technique optimizes shapes and distributions of Gaussians, ensuring high-quality geometric and textural outputs.
Compositional Optimization with Diffusion Priors: The paper proposes a two-level generative approach. First, it employs multi-view diffusion for instance level optimization, then conditioned diffusion for scene-wide coherence. This dual approach enables GALA3D to generate interactions between multiple objects while maintaining semantic and spatial consistency aligned with textual prompts.
Layout Refinement: Addressing the discrepancies between LLM-derived layouts and actual scene requirements, GALA3D iteratively refines layouts during the generative process. This ensures the resulting 3D scenes are coherent with real-world geometries and textures.

Experimental Results and Analysis

The authors conducted comprehensive experiments benchmarked against state-of-the-art text-to-3D generation methods. GALA3D demonstrates superior performance across various scenarios, including single-object and multi-object scene generation. Its ability to generate texturally detailed and geometrically consistent 3D scenes underlines the effectiveness of its novel layout-guided approach.

Quantitative Excellence: GALA3D outperforms contemporary models in terms of CLIP Scores, indicative of its superior text-image alignment and scene quality. This is evident across different object count scenarios, from single objects to scenes comprising upwards of ten distinct entities.
Qualitative Enhancements: The visual quality of GALA3D-generated scenes is markedly improved, demonstrating lifelike textures and well-defined geometries. The discussed methods notably address common issues in existing frameworks, such as multi-view inconsistencies and geometry distortions.
User Study: Anecdotal evidence through user studies indicates a strong preference for GALA3D outputs, citing increased scene realism, geometric precision, and textual fidelity.

Implications and Future Directions

GALA3D sets a noteworthy precedent in employing LLMs to direct 3D scene generation, moving away from purely manual design dependencies. This significantly reduces the technical barrier for 3D content creators and provides a more interactive interface for 3D scene design.

From a theoretical perspective, integrating layout priors and adaptive geometry control unveils new pathways in representation learning, particularly in merging natural language processing with computer vision for scene understanding.

Practically, GALA3D opens avenues for applications in virtual reality, gaming, and digital content creation where high-quality 3D assets are paramount. Future research could further explore refinement strategies for layout interpretation and investigate real-world application integration to harness GALA3D’s capabilities fully.

In conclusion, GALA3D exemplifies a significant step in text-to-3D generation, pushing the boundaries of how natural language and 3D scene synthesis can coalesce to produce intricate digital spaces.