Prometheus: 3D-Aware Latent Diffusion Models for Feed-Forward Text-to-3D Scene Generation (2412.21117v2)

Published 30 Dec 2024 in cs.CV

Abstract: In this work, we introduce Prometheus, a 3D-aware latent diffusion model for text-to-3D generation at both object and scene levels in seconds. We formulate 3D scene generation as multi-view, feed-forward, pixel-aligned 3D Gaussian generation within the latent diffusion paradigm. To ensure generalizability, we build our model upon pre-trained text-to-image generation model with only minimal adjustments, and further train it using a large number of images from both single-view and multi-view datasets. Furthermore, we introduce an RGB-D latent space into 3D Gaussian generation to disentangle appearance and geometry information, enabling efficient feed-forward generation of 3D Gaussians with better fidelity and geometry. Extensive experimental results demonstrate the effectiveness of our method in both feed-forward 3D Gaussian reconstruction and text-to-3D generation. Project page: https://freemty.github.io/project-prometheus/

Summary

The paper introduces a novel feed-forward approach that uses 3D-aware latent diffusion with 2D image priors to generate 3D scenes from textual descriptions in roughly 8 seconds.
The method employs a two-stage training process combining a 3D Gaussian VAE for multi-view fusion and a geometry-aware denoiser initialized from Stable Diffusion.
Practical applications include rapid 3D asset creation and virtual environment prototyping, though challenges remain in achieving full multi-view consistency and fine detail rendering.

Prometheus (2412.21117) introduces a novel approach for efficient, feed-forward text-to-3D scene generation using a 3D-aware latent diffusion model built upon 2D image priors. The core idea is to leverage large-scale 2D data and pre-trained text-to-image models (specifically Stable Diffusion) to enable rapid 3D generation in a feed-forward manner, contrasting with computationally intensive optimization-based methods.

The method formulates 3D scene generation as predicting pixel-aligned 3D Gaussian primitives from multiple views within a latent diffusion framework. This is achieved through a two-stage training process:

Stage 1: 3D Gaussian Variational Autoencoder (GS-VAE): This stage trains an autoencoder to compress multi-view (or single-view) RGB-D images into a compact latent space and decode them into pixel-aligned 3D Gaussians.
- Encoding: Given multi-view RGB images and their estimated monocular depth maps (obtained using an off-the-shelf model like DepthAnything-V2), a pre-trained and frozen Stable Diffusion image encoder is used to encode both RGB and depth information independently into multi-view latent representations $#1{Z}$. The encoded RGB and depth latents are concatenated.
- Multi-View Fusion: To integrate information across views, a multi-view transformer is employed. Camera poses for each view, represented using Plücker coordinates, are injected into this transformer along with the multi-view latent codes. This transformer outputs a fused latent code $\tilde{#1{Z}}$ that aggregates cross-view context. The weights of this transformer are initialized from a pre-trained model (e.g., RayDiff).
- Decoding: A decoder, adapted from a pre-trained Stable Diffusion image decoder by adjusting channel numbers, takes the raw multi-view latent codes ($#1{Z}$), the fused latent code ($\tilde{#1{Z}}$), and the camera ray maps ($#1{R}$) as input. It outputs pixel-aligned 3D Gaussians $#1{F}$ for each view. Each Gaussian is parameterized by depth, rotation (quaternion), scale, opacity, and spherical harmonics coefficients, totaling 12 channels ( $C_G=12$ ).
- Aggregation & Loss: The pixel-aligned Gaussians from multiple views are aggregated into a single scene-level 3D Gaussian representation $G$ . This 3D representation can be rendered from arbitrary viewpoints to obtain RGB images ( $\hat{I}$ ) and depth maps ( $\hat{D}$ ). The GS-VAE is trained using a loss function combining MSE and VGG perceptual loss on the rendered RGB images ( $L_{render}$ ) and a scale-invariant depth loss on the rendered depth maps ( $L_{depth}$ ) against the estimated depth maps. The total loss is $L(\phi) = \lambda_{1} L_{mse} + \lambda_{2} L_{vgg} +\lambda_{3} L_{depth}$ . The GS-VAE is trained on 8 A800 GPUs for approximately 4 days.
Stage 2: Geometry-Aware Multi-View Denoiser (MV-LDM): This stage trains a latent diffusion model that generates the multi-view RGB-D latent codes $#1{Z}$ conditioned on text prompts and camera poses.
- Training: A continuous-time denoising diffusion process is used. A learnable denoiser $G$ predicts the clean latent $#1{Z}_0$ from noisy latents $#1{Z}_t$. The model is trained using a denoising score matching objective $L(\theta)=\mathbb{E}_{#1{Z}, #1{R}, y, \sigma_t} \left[\lambda(\sigma_t) \| \hat{#1{Z}_0 - #1{Z}_0 \|_2^2 \right]$.
- Architecture: The denoiser $G$ is based on a UNet architecture, initialized from a pre-trained Stable Diffusion 2.1 UNet. Self-attention blocks are replaced with 3D cross-view self-attention blocks to handle multi-view inputs. Text conditioning ( $y$ ) is incorporated via cross-attention, and camera pose conditioning ($#1{R}$) is done by concatenating ray maps with the noisy latents.
- Noise Levels: The paper highlights the importance of noise levels for learning global structure and multi-view consistency. Different noise distribution parameters ( $P_{mean}$ , $P_{std}$ ) are used for multi-view training compared to single-view training.
- Dataset: The model is trained on a large, diverse dataset combining single-view images (like SAM-1B with captions) and various multi-view datasets spanning object-centric, indoor, outdoor, and driving scenes (MVImgNet, DL3DV-10K, Objaverse, ACID, RealEstate10K, KITTI, KITTI-360, nuScenes, Waymo). This training utilizes 32 A800 GPUs for about 7 days.

Feed-Forward Text-to-3D Generation:

During inference, the process is feed-forward:

Start with randomly sampled Gaussian noise ${#1{Z}_T$.
Use the trained MV-LDM ( $G$ ) to iteratively denoise the latents, conditioned on the text prompt ( $y$ ) and desired camera poses ($#1{R}$), yielding the multi-view RGB-D latents $#1{Z}$.
Feed $#1{Z}$ and $#1{R}$ to the cross-view transformer ( $C$ ) to get the fused latent $\tilde{#1{Z}$.
Use the GS-VAE decoder ( $D$ ) with $#1{Z}$, $\tilde{#1{Z}}$, and $#1{R}$ to output the pixel-aligned 3D Gaussians.
Aggregate the pixel-aligned Gaussians into the final scene-level 3D Gaussian representation $G$ .

This pipeline allows generating a 3D scene in approximately 8 seconds, significantly faster than optimization-based methods.

Inference Strategy:

Classifier-Free Guidance (CFG) is used during sampling to guide the generation towards the text prompt and camera poses.
A hybrid sampling guidance strategy is adopted (inspired by HarmonyView) to balance text alignment and multi-view consistency, addressing the issue where naive CFG can compromise consistency.
CFG-rescale is also applied to prevent over-saturation.

Implementation Considerations:

Computational Resources: Training requires substantial GPU resources (e.g., 8 A800s for GS-VAE, 32 A800s for MV-LDM) and time (several days for each stage).
Data Pipeline: Processing a large, diverse dataset including multi-view data and generating pseudo-depth maps on the fly requires a robust data pipeline.
Pre-trained Models: Relies heavily on pre-trained Stable Diffusion components for both encoder and decoder, and a multi-view transformer initialized from another model (RayDiff). This leverages existing strong 2D priors but means performance is dependent on the quality and generalizability of these base models.
Memory: Handling multi-view latents and pixel-aligned Gaussians ( $N \times H \times W \times C_G$ ) can be memory intensive, especially for high resolutions or many views. The latent space approach helps manage this compared to working directly in pixel space.

Practical Applications:

Rapid 3D Asset Generation: Enables artists and developers to quickly generate initial 3D models or scene layouts from text prompts for use in games, VR/AR, simulations, or creative content creation pipelines.
Populating Virtual Environments: Can be used to quickly generate diverse objects and scenes to populate large virtual worlds or synthetic datasets.
Concept Prototyping: Facilitates rapid prototyping of 3D concepts based on textual descriptions.

Limitations:

Multi-view inconsistency: Despite efforts to improve it, the model can still exhibit inconsistencies in rendered views, especially under large rotations or extreme camera poses. This is attributed to generating in a latent space without explicit 3D structure constraints during diffusion.
Text misalignment: Occasional failure to follow specific text details (e.g., object color). The authors suggest this might be due to the joint training process interfering with the pre-trained text embedding layer.
High-frequency details: May struggle to render fine, high-frequency geometric structures accurately.

Overall, Prometheus offers a significant step towards efficient and generalizable feed-forward text-to-3D generation by effectively combining strong 2D priors from latent diffusion models with multi-view learning and 3D Gaussian Splatting representation. Its speed makes it particularly appealing for applications requiring rapid 3D content creation.