Here is a detailed summary of the paper "Zero-1-to-G: Taming Pretrained 2D Diffusion Model for Direct 3D Generation" (Meng et al., 9 Jan 2025 ).
The paper addresses the challenge of generating high-quality 3D objects directly from a single image, a task hampered by the relative scarcity of 3D datasets compared to 2D data and the difficulty in maintaining 3D consistency from 2D inputs. Existing methods include per-scene optimization (score distillation like DreamFusion), multi-view image generation followed by reconstruction (two-stage methods like LGM, InstantMesh), and direct diffusion on 3D representations (like meshes or neural fields). While optimization-based methods can produce high quality, they are slow and prone to inconsistencies. Two-stage methods are faster but suffer from geometric inaccuracies and blurry textures due to inconsistencies in generated multi-view images. Direct 3D diffusion methods often require training from scratch on limited 3D datasets, restricting generalization and requiring significant resources.
Zero-1-to-G proposes a novel single-stage approach that bridges the gap between powerful 2D diffusion models and 3D generation by operating directly on a 3D representation: Gaussian splats. The key insight is to represent a 3D object as a set of multi-view Splatter Images [Szymanowicz23splatter]. A Splatter Image is a 2D grid encoding the attributes (RGB, scale, rotation, opacity, position) of Gaussian splats. By decomposing the 14 channels of a multi-view splatter image into five 3-channel images corresponding to these attributes for multiple views, the task is reframed into generating a set of 2D images. This allows Zero-1-to-G to leverage the rich priors of large-scale pretrained 2D diffusion models, specifically by fine-tuning a Stable Diffusion model.
To ensure that the generated multi-view, multi-attribute images are 3D-consistent and represent the same object, the authors augment the pretrained diffusion UNet with:
- Cross-view attention: This mechanism facilitates information exchange between tokens corresponding to the same attribute but from different viewpoints. This helps the model learn a consistent multi-view distribution for each Gaussian attribute.
- Cross-attribute attention: Building on existing work [long2023wonder3d], this mechanism enables interaction between tokens representing different attributes from the same viewpoint, maintaining coherence across the generated attribute images for a single view.
During training, the diffusion process and loss function are extended to handle batches containing multiple views and attributes, with noise sampled independently for each view and attribute. The model learns to predict the noise added to these multi-view, multi-attribute latent representations.
A crucial aspect is the preparation of ground truth Splatter Images for training. Instead of fitting Gaussian splats to each object individually (which can introduce high-frequency artifacts problematic for VAE encoding), the authors fine-tune a reconstruction model (based on LGM [tang2024lgm]) to generate smoother Splatter Images from ground truth multi-view renderings. This ensures the ground truth data is well-suited for encoding and decoding with the pretrained VAE.
Furthermore, the VAE decoder of the pretrained Stable Diffusion model, initially trained on natural images, is fine-tuned using rendering losses. This is necessary because small pixel variations in Splatter Images can significantly impact the final 3D rendering quality, and the VAE needs to accurately reconstruct high-frequency details. The decoder fine-tuning loss includes reconstruction loss on the splatter image itself, and rendering losses (MSE and LPIPS on RGB, cosine similarity on normals, binary cross-entropy on masks) computed from the rendered Gaussian splats.
For implementation, Zero-1-to-G is initialized from Stable Diffusion Image Variations. Training proceeds in two stages: first fine-tuning the UNet with multi-view attention, then adding and jointly fine-tuning cross-attribute attention along with the multi-view attention. This entire training process is notably efficient, requiring only 3 days on 8 NVIDIA L40 GPUs.
The model is evaluated on the Google Scanned Objects (GSO) dataset and in-the-wild images, compared against various baselines including reconstruction-based, two-stage, and other direct 3D generation methods. Quantitative results on GSO show superior performance across standard metrics (PSNR, SSIM, LPIPS) and geometry assessment (Chamfer Distance), outperforming methods like TriplaneGaussian [zou2023triplane], TripoSR [tochilkin2024triposr], LGM [tang2024lgm], InstantMesh [xu2024instantmesh], and LN3Diff [lan2024ln3diff]. Qualitatively, Zero-1-to-G generates more accurate geometry and consistent renderings on in-the-wild inputs compared to two-stage methods that struggle with multi-view consistency and direct 3D methods that may lack detail or generalization.
The ablation paper confirms the importance of key components: fine-tuning the VAE decoder significantly improves rendering quality, cross-attribute attention helps maintain coherence and prevent artifacts like "floaters," and using the pretrained diffusion prior is essential for convergence and achieving meaningful results.
Inference speed is 8.7 seconds per object on a single NVIDIA L40 GPU (using 35 denoising steps), competitive with or faster than some baselines but slower than purely regression-based models.
Limitations of Zero-1-to-G include its inference speed compared to non-generative regression models and the fact that material and lighting effects are currently baked into the Gaussian splat textures. Future work could explore diffusion distillation techniques for faster inference and incorporating inverse rendering to disentangle material properties.
In summary, Zero-1-to-G presents an effective strategy for direct single-view to 3D Gaussian splat generation by cleverly adapting pretrained 2D diffusion models. By decomposing 3D data into a multi-view 2D image format, incorporating architectural modifications for 3D consistency, and fine-tuning the VAE for the target domain, it achieves state-of-the-art generation quality and generalization ability while being computationally more efficient than many prior 3D generation methods.