Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 99 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 40 tok/s
GPT-5 High 38 tok/s Pro
GPT-4o 101 tok/s
GPT OSS 120B 470 tok/s Pro
Kimi K2 161 tok/s Pro
2000 character limit reached

Prometheus: 3D-Aware Latent Diffusion Models for Feed-Forward Text-to-3D Scene Generation (2412.21117v2)

Published 30 Dec 2024 in cs.CV

Abstract: In this work, we introduce Prometheus, a 3D-aware latent diffusion model for text-to-3D generation at both object and scene levels in seconds. We formulate 3D scene generation as multi-view, feed-forward, pixel-aligned 3D Gaussian generation within the latent diffusion paradigm. To ensure generalizability, we build our model upon pre-trained text-to-image generation model with only minimal adjustments, and further train it using a large number of images from both single-view and multi-view datasets. Furthermore, we introduce an RGB-D latent space into 3D Gaussian generation to disentangle appearance and geometry information, enabling efficient feed-forward generation of 3D Gaussians with better fidelity and geometry. Extensive experimental results demonstrate the effectiveness of our method in both feed-forward 3D Gaussian reconstruction and text-to-3D generation. Project page: https://freemty.github.io/project-prometheus/

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces a novel feed-forward approach that uses 3D-aware latent diffusion with 2D image priors to generate 3D scenes from textual descriptions in roughly 8 seconds.
  • The method employs a two-stage training process combining a 3D Gaussian VAE for multi-view fusion and a geometry-aware denoiser initialized from Stable Diffusion.
  • Practical applications include rapid 3D asset creation and virtual environment prototyping, though challenges remain in achieving full multi-view consistency and fine detail rendering.

Prometheus (2412.21117) introduces a novel approach for efficient, feed-forward text-to-3D scene generation using a 3D-aware latent diffusion model built upon 2D image priors. The core idea is to leverage large-scale 2D data and pre-trained text-to-image models (specifically Stable Diffusion) to enable rapid 3D generation in a feed-forward manner, contrasting with computationally intensive optimization-based methods.

The method formulates 3D scene generation as predicting pixel-aligned 3D Gaussian primitives from multiple views within a latent diffusion framework. This is achieved through a two-stage training process:

  1. Stage 1: 3D Gaussian Variational Autoencoder (GS-VAE): This stage trains an autoencoder to compress multi-view (or single-view) RGB-D images into a compact latent space and decode them into pixel-aligned 3D Gaussians.
    • Encoding: Given multi-view RGB images and their estimated monocular depth maps (obtained using an off-the-shelf model like DepthAnything-V2), a pre-trained and frozen Stable Diffusion image encoder is used to encode both RGB and depth information independently into multi-view latent representations $#1{Z}$. The encoded RGB and depth latents are concatenated.
    • Multi-View Fusion: To integrate information across views, a multi-view transformer is employed. Camera poses for each view, represented using Plücker coordinates, are injected into this transformer along with the multi-view latent codes. This transformer outputs a fused latent code $\tilde{#1{Z}}$ that aggregates cross-view context. The weights of this transformer are initialized from a pre-trained model (e.g., RayDiff).
    • Decoding: A decoder, adapted from a pre-trained Stable Diffusion image decoder by adjusting channel numbers, takes the raw multi-view latent codes ($#1{Z}$), the fused latent code ($\tilde{#1{Z}}$), and the camera ray maps ($#1{R}$) as input. It outputs pixel-aligned 3D Gaussians $#1{F}$ for each view. Each Gaussian is parameterized by depth, rotation (quaternion), scale, opacity, and spherical harmonics coefficients, totaling 12 channels (CG=12C_G=12).
    • Aggregation & Loss: The pixel-aligned Gaussians from multiple views are aggregated into a single scene-level 3D Gaussian representation GG. This 3D representation can be rendered from arbitrary viewpoints to obtain RGB images (I^\hat{I}) and depth maps (D^\hat{D}). The GS-VAE is trained using a loss function combining MSE and VGG perceptual loss on the rendered RGB images (LrenderL_{render}) and a scale-invariant depth loss on the rendered depth maps (LdepthL_{depth}) against the estimated depth maps. The total loss is L(ϕ)=λ1Lmse+λ2Lvgg+λ3LdepthL(\phi) = \lambda_{1} L_{mse} + \lambda_{2} L_{vgg} +\lambda_{3} L_{depth}. The GS-VAE is trained on 8 A800 GPUs for approximately 4 days.
  2. Stage 2: Geometry-Aware Multi-View Denoiser (MV-LDM): This stage trains a latent diffusion model that generates the multi-view RGB-D latent codes $#1{Z}$ conditioned on text prompts and camera poses.
    • Training: A continuous-time denoising diffusion process is used. A learnable denoiser GG predicts the clean latent $#1{Z}_0$ from noisy latents $#1{Z}_t$. The model is trained using a denoising score matching objective $L(\theta)=\mathbb{E}_{#1{Z}, #1{R}, y, \sigma_t} \left[\lambda(\sigma_t) \| \hat{#1{Z}_0 - #1{Z}_0 \|_2^2 \right]$.
    • Architecture: The denoiser GG is based on a UNet architecture, initialized from a pre-trained Stable Diffusion 2.1 UNet. Self-attention blocks are replaced with 3D cross-view self-attention blocks to handle multi-view inputs. Text conditioning (yy) is incorporated via cross-attention, and camera pose conditioning ($#1{R}$) is done by concatenating ray maps with the noisy latents.
    • Noise Levels: The paper highlights the importance of noise levels for learning global structure and multi-view consistency. Different noise distribution parameters (PmeanP_{mean}, PstdP_{std}) are used for multi-view training compared to single-view training.
    • Dataset: The model is trained on a large, diverse dataset combining single-view images (like SAM-1B with captions) and various multi-view datasets spanning object-centric, indoor, outdoor, and driving scenes (MVImgNet, DL3DV-10K, Objaverse, ACID, RealEstate10K, KITTI, KITTI-360, nuScenes, Waymo). This training utilizes 32 A800 GPUs for about 7 days.

Feed-Forward Text-to-3D Generation:

During inference, the process is feed-forward:

  1. Start with randomly sampled Gaussian noise ${#1{Z}_T$.
  2. Use the trained MV-LDM (GG) to iteratively denoise the latents, conditioned on the text prompt (yy) and desired camera poses ($#1{R}$), yielding the multi-view RGB-D latents $#1{Z}$.
  3. Feed $#1{Z}$ and $#1{R}$ to the cross-view transformer (CC) to get the fused latent $\tilde{#1{Z}$.
  4. Use the GS-VAE decoder (DD) with $#1{Z}$, $\tilde{#1{Z}}$, and $#1{R}$ to output the pixel-aligned 3D Gaussians.
  5. Aggregate the pixel-aligned Gaussians into the final scene-level 3D Gaussian representation GG.

This pipeline allows generating a 3D scene in approximately 8 seconds, significantly faster than optimization-based methods.

Inference Strategy:

  • Classifier-Free Guidance (CFG) is used during sampling to guide the generation towards the text prompt and camera poses.
  • A hybrid sampling guidance strategy is adopted (inspired by HarmonyView) to balance text alignment and multi-view consistency, addressing the issue where naive CFG can compromise consistency.
  • CFG-rescale is also applied to prevent over-saturation.

Implementation Considerations:

  • Computational Resources: Training requires substantial GPU resources (e.g., 8 A800s for GS-VAE, 32 A800s for MV-LDM) and time (several days for each stage).
  • Data Pipeline: Processing a large, diverse dataset including multi-view data and generating pseudo-depth maps on the fly requires a robust data pipeline.
  • Pre-trained Models: Relies heavily on pre-trained Stable Diffusion components for both encoder and decoder, and a multi-view transformer initialized from another model (RayDiff). This leverages existing strong 2D priors but means performance is dependent on the quality and generalizability of these base models.
  • Memory: Handling multi-view latents and pixel-aligned Gaussians (N×H×W×CGN \times H \times W \times C_G) can be memory intensive, especially for high resolutions or many views. The latent space approach helps manage this compared to working directly in pixel space.

Practical Applications:

  • Rapid 3D Asset Generation: Enables artists and developers to quickly generate initial 3D models or scene layouts from text prompts for use in games, VR/AR, simulations, or creative content creation pipelines.
  • Populating Virtual Environments: Can be used to quickly generate diverse objects and scenes to populate large virtual worlds or synthetic datasets.
  • Concept Prototyping: Facilitates rapid prototyping of 3D concepts based on textual descriptions.

Limitations:

  • Multi-view inconsistency: Despite efforts to improve it, the model can still exhibit inconsistencies in rendered views, especially under large rotations or extreme camera poses. This is attributed to generating in a latent space without explicit 3D structure constraints during diffusion.
  • Text misalignment: Occasional failure to follow specific text details (e.g., object color). The authors suggest this might be due to the joint training process interfering with the pre-trained text embedding layer.
  • High-frequency details: May struggle to render fine, high-frequency geometric structures accurately.

Overall, Prometheus offers a significant step towards efficient and generalizable feed-forward text-to-3D generation by effectively combining strong 2D priors from latent diffusion models with multi-view learning and 3D Gaussian Splatting representation. Its speed makes it particularly appealing for applications requiring rapid 3D content creation.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube