- The paper introduces a novel approach for single-image 3D Gaussian generation that ensures multi-view geometric consistency using a VAE-based latent space.
- It leverages an Anchor-GS VAE to encode intricate 3D object information and employs a seed-point-driven flow model for effective drag-based editing.
- Evaluations on Objaverse and GSO datasets demonstrate state-of-the-art quality and efficient generation compared to methods relying on 2D diffusion priors.
Dragen3D (2502.16475) introduces a novel framework for single-image 3D generation using 3D Gaussian Splatting (3DGS) that emphasizes multi-view geometric consistency and offers drag-based editing control. The paper addresses key limitations of existing methods, such as geometric inconsistencies from 2D diffusion priors and lack of intuitive shape control during generation.
The core of Dragen3D is built upon two main components: an Anchor-Gaussian Variational Autoencoder (Anchor-GS VAE) and a Seed-Point-Driven generation and editing strategy.
The Anchor-GS VAE is designed to efficiently encode the complex 3D information of an object into a compact latent space and decode it back into a 3DGS representation.
- Encoding: It utilizes anchor points, a sparse set of points sampled from the object's surface point cloud using Farthest Point Sampling (FPS). A Geometry-Texture Encoder takes the anchor points and their corresponding features projected onto an input image, processes them through Transformer blocks with cross-attention to the full point cloud and image tokens, and outputs fixed-length anchor latents (Z). This process captures both geometric and texture information.
- Decoding: The decoder employs a coarse-to-fine approach. Anchor latents Z are first processed by a Transformer. An intermediate layer's output is used to predict coarse positions for the anchor points. The final layer's output, along with offsets, determines the fine-grained positions of Gaussian points assigned to each anchor. Attributes like color, opacity, scale, and rotation for each Gaussian are then decoded from features interpolated from nearby anchor latents.
- Training: The VAE is trained end-to-end without requiring a pre-existing 3DGS dataset. Supervisions include rendering loss (MSE, SSIM, LPIPS) comparing rendered images from the decoded 3DGS with ground truth views, and 3D point cloud losses (Chamfer Distance, Earth Mover's Distance) on both reconstructed anchor points and Gaussian points against ground truth object points. KL divergence is applied as regularization on the anchor latents.
The Seed-Point-Driven strategy enables generation from a single image and supports drag-based editing:
- Seed Points Generation: From a single input image, the model first generates a sparse set of seed points (e.g., 256 points) representing a rough geometry. This is achieved using a Rectified Flow diffusion model conditioned on the input image, trained to map Gaussian noise to the distribution of seed points. The sparse nature makes this distribution easier to learn and ensures geometric consistency.
- Seed-Anchor Mapping Module: This module maps the sparse seed points to the dense anchor latents. It's formulated as a flow matching problem between seed point latents (ZS) and anchor latents (Z).
- Dimension Alignment: Seed points are encoded into latents (ZS) using the frozen Anchor-GS VAE encoder, conditioned on the image. This aligns their dimensionality with the target anchor latents and provides spatial-semantic information.
- Token Alignment: To establish a correspondence for flow matching, a cluster-based strategy is used. Based on the encoded positions, each seed latent token is used as a center to identify a cluster of semantically similar anchor latent tokens (via KNN). Seed latents are then repeated to match the number of anchor latents, maintaining semantic correspondence within clusters.
- A Rectified Flow model, implemented with Transformer blocks, learns the mapping from aligned seed latents (ZS) to aligned anchor latents (Z), conditioned on the input image. Noise augmentation is applied during training for robustness.
- Seed-Points-Driven Deformation: This enables intuitive geometric editing. Users can drag the sparse seed points in 3D space using standard 3D tools. The edited seed points are then encoded, and a mask is applied to preserve unedited regions. The resulting modified seed latents are passed through the Seed-Anchor Mapping module to generate new anchor latents, which are decoded by the VAE into the deformed 3DGS. This process is efficient, typically taking only a couple of seconds.
The implementation uses Transformers for the VAE encoder/decoder and the flow models. DINOv2 is used for image feature extraction. Training involves multiple stages with specific datasets, hardware, and hyperparameters.
Evaluations on Objaverse and Google Scanned Objects (GSO) datasets show that Dragen3D achieves state-of-the-art quality compared to prior methods, particularly in multi-view geometric consistency, outperforming methods reliant on 2D diffusion priors like LGM and LaRa, while having a competitive inference time with methods like TriplaneGS.
Ablation studies highlight the importance of the Rectified Flow model for seed point generation (vs. a simple Transformer feed-forward), encoding seed points for dimension alignment (vs. positional encoding), and the proposed cluster-based token alignment for effective seed-anchor mapping.
In summary, Dragen3D provides a practical approach for controllable and geometrically consistent 3D object generation from a single image by leveraging a VAE for latent 3D representation and a flow-based mechanism that maps sparse, editable seed points to dense Gaussian representations, enabling efficient generation and intuitive drag-based deformation. The method does not rely on computationally expensive scene-by-scene optimization or prone-to-inconsistency 2D diffusion priors for the final 3D representation.