Bolt3D: Generating 3D Scenes in Seconds (2503.14445v1)

Published 18 Mar 2025 in cs.CV

Abstract: We present a latent diffusion model for fast feed-forward 3D scene generation. Given one or more images, our model Bolt3D directly samples a 3D scene representation in less than seven seconds on a single GPU. We achieve this by leveraging powerful and scalable existing 2D diffusion network architectures to produce consistent high-fidelity 3D scene representations. To train this model, we create a large-scale multiview-consistent dataset of 3D geometry and appearance by applying state-of-the-art dense 3D reconstruction techniques to existing multiview image datasets. Compared to prior multiview generative models that require per-scene optimization for 3D reconstruction, Bolt3D reduces the inference cost by a factor of up to 300 times.

Summary

The paper presents a novel feed-forward latent diffusion method that generates complete 3D scenes directly from limited input images.
It leverages a specialized Geometry VAE and Gaussian head to efficiently predict detailed 3D representations using multi-view data.
Bolt3D achieves competitive visual quality compared to optimization methods while reducing inference time by up to 300x.

Bolt3D is a novel approach for generating 3D scenes from one or more input images in a fast, feed-forward manner, outputting a renderable 3D representation in seconds. The ability to generate 3D content directly, rather than just 2D images, is crucial for interactive applications like visualization and editing. However, scaling generative models, particularly diffusion models, to detailed 3D scenes is challenging due to the scarcity of large-scale 3D training data and the difficulty in effectively representing and training on 3D data. Existing methods often rely on slow per-scene optimization or are limited to synthetic objects or partial scenes.

Bolt3D addresses these challenges by leveraging scalable 2D diffusion network architectures adapted for 3D generation and introducing a large-scale, multi-view consistent 3D dataset. The core idea is to use a latent diffusion model (LDM) to generate a 3D scene representation directly, avoiding the need for time-consuming test-time optimization.

The method represents 3D scenes using sets of 3D Gaussians organized into multiple "Splatter Images" [szymanowicz24splatter]. Each Splatter Image corresponds to a view and stores parameters (color, 3D position, opacity, covariance) for pixel-aligned Gaussians. The generation process is factorized into two stages:

Latent Diffusion Model: This model takes one or more posed input images and target camera poses and jointly generates per-view latent appearance (for target images) and latent geometry (for target and source pointmaps). It is based on a multi-view image diffusion model, fine-tuned to handle geometry. Images and geometry (pointmaps and camera raymaps) are encoded into separate latent spaces using VAEs.
Gaussian Head: This feed-forward network takes the generated full-resolution images, pointmaps, and camera poses from the diffusion model's decoded latents. It then predicts the remaining 3D Gaussian parameters: refined color, opacity, and covariance matrices (scale and rotation) for each pixel, forming the Splatter Images.

A key contribution is the Geometry Variational Auto-Encoder (VAE) trained specifically to encode and decode 3D pointmaps and camera raymaps into a lower-dimensional latent space. Unlike standard VAEs pre-trained on images, which are shown to generalize poorly to unbounded 3D geometry, this geometry-specific VAE is trained from scratch on 3D data. Its training objective includes standard VAE losses (reconstruction, KL divergence) augmented with a geometry-specific loss that re-weights pointmap reconstruction based on distance from the camera and includes a gradient loss to improve sharpness. The geometry VAE uses a convolutional encoder and a transformer decoder.

To train the model, Bolt3D relies on a large-scale dataset of dense, multi-view consistent 3D geometry and appearance. This dataset is created by applying a state-of-the-art dense Structure-from-Motion framework (MASt3R [mast3r]) to existing large-scale multi-view image datasets like CO3D [reizenstein21co3d], MVImg [yu2023mvimgnet], RealEstate10K (RE10K) [zhou2018stereo], and DL3DV-7K [ling2024dl3dv], resulting in approximately 300k scenes. Synthetic object datasets (Objaverse [deitke2023objaverse]) are also used. The 3D data is normalized per scene based on the mean depth from the first camera.

The training process involves three stages: training the Geometry VAE, training the Gaussian head (supervised with rendering losses), and finally training the latent diffusion model (initialized from a multi-view image diffusion model like CAT3D [gao2024cat3d]).

Bolt3D demonstrates high-quality 3D scene generation from few input images, outperforming prior feed-forward regression-based methods (e.g., Flash3D [szymanowicz2025flash3d], DepthSplat [xu2025depthsplat]) and other feed-forward generative methods (e.g., LatentSplat [wewer24latentsplat], Wonderland [liang2024wonderland]). Its generative nature allows it to synthesize realistic content in unobserved or ambiguous regions. Compared to state-of-the-art optimization-based methods (e.g., CAT3D), Bolt3D achieves competitive visual quality while reducing inference cost by a factor of up to 300x (taking about 7 seconds on an NVIDIA H100 GPU).

Implementation details include using a U-Net architecture for the diffusion model adapted to jointly handle image and geometry latents, using FlashAttention [dao2022flashattention] for efficient attention in the Gaussian head, and applying DDIM [song2020denoising] sampling (50 steps) during inference for speed.

Limitations include struggling with very thin structures (<8 pixels), highly transparent or non-Lambertian surfaces (partially due to limitations in the dataset creation process), and sensitivity to the distribution and scale of target cameras.

In summary, Bolt3D presents a significant step towards practical 3D content generation from limited views by enabling fast, feed-forward scene synthesis using a novel LDM architecture trained on a large-scale, derived 3D dataset, and a specialized geometry representation and VAE.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/zhenjun_zhao/status/1902651129382723724

https://twitter.com/janusch_patas/status/1902374282333974892

https://twitter.com/arxivsanitybot/status/1902550070203183462

YouTube

Show All Videos