- The paper introduces a novel feed-forward approach that uses 3D-aware latent diffusion with 2D image priors to generate 3D scenes from textual descriptions in roughly 8 seconds.
- The method employs a two-stage training process combining a 3D Gaussian VAE for multi-view fusion and a geometry-aware denoiser initialized from Stable Diffusion.
- Practical applications include rapid 3D asset creation and virtual environment prototyping, though challenges remain in achieving full multi-view consistency and fine detail rendering.
Prometheus (2412.21117) introduces a novel approach for efficient, feed-forward text-to-3D scene generation using a 3D-aware latent diffusion model built upon 2D image priors. The core idea is to leverage large-scale 2D data and pre-trained text-to-image models (specifically Stable Diffusion) to enable rapid 3D generation in a feed-forward manner, contrasting with computationally intensive optimization-based methods.
The method formulates 3D scene generation as predicting pixel-aligned 3D Gaussian primitives from multiple views within a latent diffusion framework. This is achieved through a two-stage training process:
- Stage 1: 3D Gaussian Variational Autoencoder (GS-VAE): This stage trains an autoencoder to compress multi-view (or single-view) RGB-D images into a compact latent space and decode them into pixel-aligned 3D Gaussians.
- Encoding: Given multi-view RGB images and their estimated monocular depth maps (obtained using an off-the-shelf model like DepthAnything-V2), a pre-trained and frozen Stable Diffusion image encoder is used to encode both RGB and depth information independently into multi-view latent representations $#1{Z}$. The encoded RGB and depth latents are concatenated.
- Multi-View Fusion: To integrate information across views, a multi-view transformer is employed. Camera poses for each view, represented using Plücker coordinates, are injected into this transformer along with the multi-view latent codes. This transformer outputs a fused latent code $\tilde{#1{Z}}$ that aggregates cross-view context. The weights of this transformer are initialized from a pre-trained model (e.g., RayDiff).
- Decoding: A decoder, adapted from a pre-trained Stable Diffusion image decoder by adjusting channel numbers, takes the raw multi-view latent codes ($#1{Z}$), the fused latent code ($\tilde{#1{Z}}$), and the camera ray maps ($#1{R}$) as input. It outputs pixel-aligned 3D Gaussians $#1{F}$ for each view. Each Gaussian is parameterized by depth, rotation (quaternion), scale, opacity, and spherical harmonics coefficients, totaling 12 channels (CG=12).
- Aggregation & Loss: The pixel-aligned Gaussians from multiple views are aggregated into a single scene-level 3D Gaussian representation G. This 3D representation can be rendered from arbitrary viewpoints to obtain RGB images (I^) and depth maps (D^). The GS-VAE is trained using a loss function combining MSE and VGG perceptual loss on the rendered RGB images (Lrender) and a scale-invariant depth loss on the rendered depth maps (Ldepth) against the estimated depth maps. The total loss is L(ϕ)=λ1Lmse+λ2Lvgg+λ3Ldepth. The GS-VAE is trained on 8 A800 GPUs for approximately 4 days.
- Stage 2: Geometry-Aware Multi-View Denoiser (MV-LDM): This stage trains a latent diffusion model that generates the multi-view RGB-D latent codes $#1{Z}$ conditioned on text prompts and camera poses.
- Training: A continuous-time denoising diffusion process is used. A learnable denoiser G predicts the clean latent $#1{Z}_0$ from noisy latents $#1{Z}_t$. The model is trained using a denoising score matching objective $L(\theta)=\mathbb{E}_{#1{Z}, #1{R}, y, \sigma_t} \left[\lambda(\sigma_t) \| \hat{#1{Z}_0 - #1{Z}_0 \|_2^2 \right]$.
- Architecture: The denoiser G is based on a UNet architecture, initialized from a pre-trained Stable Diffusion 2.1 UNet. Self-attention blocks are replaced with 3D cross-view self-attention blocks to handle multi-view inputs. Text conditioning (y) is incorporated via cross-attention, and camera pose conditioning ($#1{R}$) is done by concatenating ray maps with the noisy latents.
- Noise Levels: The paper highlights the importance of noise levels for learning global structure and multi-view consistency. Different noise distribution parameters (Pmean, Pstd) are used for multi-view training compared to single-view training.
- Dataset: The model is trained on a large, diverse dataset combining single-view images (like SAM-1B with captions) and various multi-view datasets spanning object-centric, indoor, outdoor, and driving scenes (MVImgNet, DL3DV-10K, Objaverse, ACID, RealEstate10K, KITTI, KITTI-360, nuScenes, Waymo). This training utilizes 32 A800 GPUs for about 7 days.
Feed-Forward Text-to-3D Generation:
During inference, the process is feed-forward:
- Start with randomly sampled Gaussian noise ${#1{Z}_T$.
- Use the trained MV-LDM (G) to iteratively denoise the latents, conditioned on the text prompt (y) and desired camera poses ($#1{R}$), yielding the multi-view RGB-D latents $#1{Z}$.
- Feed $#1{Z}$ and $#1{R}$ to the cross-view transformer (C) to get the fused latent $\tilde{#1{Z}$.
- Use the GS-VAE decoder (D) with $#1{Z}$, $\tilde{#1{Z}}$, and $#1{R}$ to output the pixel-aligned 3D Gaussians.
- Aggregate the pixel-aligned Gaussians into the final scene-level 3D Gaussian representation G.
This pipeline allows generating a 3D scene in approximately 8 seconds, significantly faster than optimization-based methods.
Inference Strategy:
- Classifier-Free Guidance (CFG) is used during sampling to guide the generation towards the text prompt and camera poses.
- A hybrid sampling guidance strategy is adopted (inspired by HarmonyView) to balance text alignment and multi-view consistency, addressing the issue where naive CFG can compromise consistency.
- CFG-rescale is also applied to prevent over-saturation.
Implementation Considerations:
- Computational Resources: Training requires substantial GPU resources (e.g., 8 A800s for GS-VAE, 32 A800s for MV-LDM) and time (several days for each stage).
- Data Pipeline: Processing a large, diverse dataset including multi-view data and generating pseudo-depth maps on the fly requires a robust data pipeline.
- Pre-trained Models: Relies heavily on pre-trained Stable Diffusion components for both encoder and decoder, and a multi-view transformer initialized from another model (RayDiff). This leverages existing strong 2D priors but means performance is dependent on the quality and generalizability of these base models.
- Memory: Handling multi-view latents and pixel-aligned Gaussians (N×H×W×CG) can be memory intensive, especially for high resolutions or many views. The latent space approach helps manage this compared to working directly in pixel space.
Practical Applications:
- Rapid 3D Asset Generation: Enables artists and developers to quickly generate initial 3D models or scene layouts from text prompts for use in games, VR/AR, simulations, or creative content creation pipelines.
- Populating Virtual Environments: Can be used to quickly generate diverse objects and scenes to populate large virtual worlds or synthetic datasets.
- Concept Prototyping: Facilitates rapid prototyping of 3D concepts based on textual descriptions.
Limitations:
- Multi-view inconsistency: Despite efforts to improve it, the model can still exhibit inconsistencies in rendered views, especially under large rotations or extreme camera poses. This is attributed to generating in a latent space without explicit 3D structure constraints during diffusion.
- Text misalignment: Occasional failure to follow specific text details (e.g., object color). The authors suggest this might be due to the joint training process interfering with the pre-trained text embedding layer.
- High-frequency details: May struggle to render fine, high-frequency geometric structures accurately.
Overall, Prometheus offers a significant step towards efficient and generalizable feed-forward text-to-3D generation by effectively combining strong 2D priors from latent diffusion models with multi-view learning and 3D Gaussian Splatting representation. Its speed makes it particularly appealing for applications requiring rapid 3D content creation.