Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 183 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 29 tok/s Pro
GPT-5 High 24 tok/s Pro
GPT-4o 97 tok/s Pro
Kimi K2 221 tok/s Pro
GPT OSS 120B 440 tok/s Pro
Claude Sonnet 4.5 38 tok/s Pro
2000 character limit reached

Prometheus: 3D-Aware Latent Diffusion Models for Feed-Forward Text-to-3D Scene Generation (2412.21117v2)

Published 30 Dec 2024 in cs.CV

Abstract: In this work, we introduce Prometheus, a 3D-aware latent diffusion model for text-to-3D generation at both object and scene levels in seconds. We formulate 3D scene generation as multi-view, feed-forward, pixel-aligned 3D Gaussian generation within the latent diffusion paradigm. To ensure generalizability, we build our model upon pre-trained text-to-image generation model with only minimal adjustments, and further train it using a large number of images from both single-view and multi-view datasets. Furthermore, we introduce an RGB-D latent space into 3D Gaussian generation to disentangle appearance and geometry information, enabling efficient feed-forward generation of 3D Gaussians with better fidelity and geometry. Extensive experimental results demonstrate the effectiveness of our method in both feed-forward 3D Gaussian reconstruction and text-to-3D generation. Project page: https://freemty.github.io/project-prometheus/

Summary

  • The paper introduces a novel feed-forward approach that uses 3D-aware latent diffusion with 2D image priors to generate 3D scenes from textual descriptions in roughly 8 seconds.
  • The method employs a two-stage training process combining a 3D Gaussian VAE for multi-view fusion and a geometry-aware denoiser initialized from Stable Diffusion.
  • Practical applications include rapid 3D asset creation and virtual environment prototyping, though challenges remain in achieving full multi-view consistency and fine detail rendering.

Prometheus (2412.21117) introduces a novel approach for efficient, feed-forward text-to-3D scene generation using a 3D-aware latent diffusion model built upon 2D image priors. The core idea is to leverage large-scale 2D data and pre-trained text-to-image models (specifically Stable Diffusion) to enable rapid 3D generation in a feed-forward manner, contrasting with computationally intensive optimization-based methods.

The method formulates 3D scene generation as predicting pixel-aligned 3D Gaussian primitives from multiple views within a latent diffusion framework. This is achieved through a two-stage training process:

  1. Stage 1: 3D Gaussian Variational Autoencoder (GS-VAE): This stage trains an autoencoder to compress multi-view (or single-view) RGB-D images into a compact latent space and decode them into pixel-aligned 3D Gaussians.
    • Encoding: Given multi-view RGB images and their estimated monocular depth maps (obtained using an off-the-shelf model like DepthAnything-V2), a pre-trained and frozen Stable Diffusion image encoder is used to encode both RGB and depth information independently into multi-view latent representations $#1{Z}$. The encoded RGB and depth latents are concatenated.
    • Multi-View Fusion: To integrate information across views, a multi-view transformer is employed. Camera poses for each view, represented using Plücker coordinates, are injected into this transformer along with the multi-view latent codes. This transformer outputs a fused latent code $\tilde{#1{Z}}$ that aggregates cross-view context. The weights of this transformer are initialized from a pre-trained model (e.g., RayDiff).
    • Decoding: A decoder, adapted from a pre-trained Stable Diffusion image decoder by adjusting channel numbers, takes the raw multi-view latent codes ($#1{Z}$), the fused latent code ($\tilde{#1{Z}}$), and the camera ray maps ($#1{R}$) as input. It outputs pixel-aligned 3D Gaussians $#1{F}$ for each view. Each Gaussian is parameterized by depth, rotation (quaternion), scale, opacity, and spherical harmonics coefficients, totaling 12 channels (CG=12C_G=12).
    • Aggregation & Loss: The pixel-aligned Gaussians from multiple views are aggregated into a single scene-level 3D Gaussian representation GG. This 3D representation can be rendered from arbitrary viewpoints to obtain RGB images (I^\hat{I}) and depth maps (D^\hat{D}). The GS-VAE is trained using a loss function combining MSE and VGG perceptual loss on the rendered RGB images (LrenderL_{render}) and a scale-invariant depth loss on the rendered depth maps (LdepthL_{depth}) against the estimated depth maps. The total loss is L(Ï•)=λ1Lmse+λ2Lvgg+λ3LdepthL(\phi) = \lambda_{1} L_{mse} + \lambda_{2} L_{vgg} +\lambda_{3} L_{depth}. The GS-VAE is trained on 8 A800 GPUs for approximately 4 days.
  2. Stage 2: Geometry-Aware Multi-View Denoiser (MV-LDM): This stage trains a latent diffusion model that generates the multi-view RGB-D latent codes $#1{Z}$ conditioned on text prompts and camera poses.
    • Training: A continuous-time denoising diffusion process is used. A learnable denoiser GG predicts the clean latent $#1{Z}_0$ from noisy latents $#1{Z}_t$. The model is trained using a denoising score matching objective $L(\theta)=\mathbb{E}_{#1{Z}, #1{R}, y, \sigma_t} \left[\lambda(\sigma_t) \| \hat{#1{Z}_0 - #1{Z}_0 \|_2^2 \right]$.
    • Architecture: The denoiser GG is based on a UNet architecture, initialized from a pre-trained Stable Diffusion 2.1 UNet. Self-attention blocks are replaced with 3D cross-view self-attention blocks to handle multi-view inputs. Text conditioning (yy) is incorporated via cross-attention, and camera pose conditioning ($#1{R}$) is done by concatenating ray maps with the noisy latents.
    • Noise Levels: The paper highlights the importance of noise levels for learning global structure and multi-view consistency. Different noise distribution parameters (PmeanP_{mean}, PstdP_{std}) are used for multi-view training compared to single-view training.
    • Dataset: The model is trained on a large, diverse dataset combining single-view images (like SAM-1B with captions) and various multi-view datasets spanning object-centric, indoor, outdoor, and driving scenes (MVImgNet, DL3DV-10K, Objaverse, ACID, RealEstate10K, KITTI, KITTI-360, nuScenes, Waymo). This training utilizes 32 A800 GPUs for about 7 days.

Feed-Forward Text-to-3D Generation:

During inference, the process is feed-forward:

  1. Start with randomly sampled Gaussian noise ${#1{Z}_T$.
  2. Use the trained MV-LDM (GG) to iteratively denoise the latents, conditioned on the text prompt (yy) and desired camera poses ($#1{R}$), yielding the multi-view RGB-D latents $#1{Z}$.
  3. Feed $#1{Z}$ and $#1{R}$ to the cross-view transformer (CC) to get the fused latent $\tilde{#1{Z}$.
  4. Use the GS-VAE decoder (DD) with $#1{Z}$, $\tilde{#1{Z}}$, and $#1{R}$ to output the pixel-aligned 3D Gaussians.
  5. Aggregate the pixel-aligned Gaussians into the final scene-level 3D Gaussian representation GG.

This pipeline allows generating a 3D scene in approximately 8 seconds, significantly faster than optimization-based methods.

Inference Strategy:

  • Classifier-Free Guidance (CFG) is used during sampling to guide the generation towards the text prompt and camera poses.
  • A hybrid sampling guidance strategy is adopted (inspired by HarmonyView) to balance text alignment and multi-view consistency, addressing the issue where naive CFG can compromise consistency.
  • CFG-rescale is also applied to prevent over-saturation.

Implementation Considerations:

  • Computational Resources: Training requires substantial GPU resources (e.g., 8 A800s for GS-VAE, 32 A800s for MV-LDM) and time (several days for each stage).
  • Data Pipeline: Processing a large, diverse dataset including multi-view data and generating pseudo-depth maps on the fly requires a robust data pipeline.
  • Pre-trained Models: Relies heavily on pre-trained Stable Diffusion components for both encoder and decoder, and a multi-view transformer initialized from another model (RayDiff). This leverages existing strong 2D priors but means performance is dependent on the quality and generalizability of these base models.
  • Memory: Handling multi-view latents and pixel-aligned Gaussians (N×H×W×CGN \times H \times W \times C_G) can be memory intensive, especially for high resolutions or many views. The latent space approach helps manage this compared to working directly in pixel space.

Practical Applications:

  • Rapid 3D Asset Generation: Enables artists and developers to quickly generate initial 3D models or scene layouts from text prompts for use in games, VR/AR, simulations, or creative content creation pipelines.
  • Populating Virtual Environments: Can be used to quickly generate diverse objects and scenes to populate large virtual worlds or synthetic datasets.
  • Concept Prototyping: Facilitates rapid prototyping of 3D concepts based on textual descriptions.

Limitations:

  • Multi-view inconsistency: Despite efforts to improve it, the model can still exhibit inconsistencies in rendered views, especially under large rotations or extreme camera poses. This is attributed to generating in a latent space without explicit 3D structure constraints during diffusion.
  • Text misalignment: Occasional failure to follow specific text details (e.g., object color). The authors suggest this might be due to the joint training process interfering with the pre-trained text embedding layer.
  • High-frequency details: May struggle to render fine, high-frequency geometric structures accurately.

Overall, Prometheus offers a significant step towards efficient and generalizable feed-forward text-to-3D generation by effectively combining strong 2D priors from latent diffusion models with multi-view learning and 3D Gaussian Splatting representation. Its speed makes it particularly appealing for applications requiring rapid 3D content creation.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 1 tweet and received 2 likes.

Upgrade to Pro to view all of the tweets about this paper:

Reddit Logo Streamline Icon: https://streamlinehq.com