- The paper introduces a novel framework that synthesizes high-fidelity 3D shapes from a single image using large-scale rectified flow models and an advanced data pipeline.
- It leverages a transformer-based VAE with supervised SDF, surface normal, eikonal, and KL losses to achieve precise and detailed geometric reconstruction.
- Scaling strategies including latent resolution upscaling and a Mixture-of-Experts architecture enable state-of-the-art performance validated by Normal-FID and GPTEval3D metrics.
TripoSG is a novel framework for high-fidelity 3D shape synthesis from a single image, leveraging large-scale rectified flow models. The paper addresses the challenges in 3D shape generation, specifically limitations in data scale and quality, the complexity of 3D processing, and the need for advanced generative techniques that yield high-quality outputs with precise alignment to input conditions.
The core of TripoSG consists of two main components: a robust Data-Building System and a large-scale Rectified Flow Transformer model trained on latent 3D representations.
Data-Building System
A key contribution is the development of a sophisticated data processing pipeline to curate and prepare a large dataset of 2 million high-quality 3D samples from diverse sources like Objaverse and ShapeNet. This system emphasizes the critical role of data quality and quantity for training 3D generative models. The pipeline involves four stages:
- Data Scoring: Models are scored based on visual quality using a linear regression model trained on features from multi-view normal maps manually rated by professionals (using CLIP and DINOv2 features).
- Data Filtering: Automated filters remove models with issues like large planar bases, rendering errors, or multiple objects based on geometric and visual properties.
- Data Fixing and Augmentation: This stage includes orientation fixing for character models (using a trained orientation estimation model based on multi-view DINOv2 features) and generating multi-view RGB data for untextured models using ControlNet++ conditioned on normal maps.
- Field Data Production: Non-watertight meshes are converted into a watertight representation suitable for neural implicit fields. This is done by constructing a 5123 Unsigned Distance Function (UDF) grid, extracting an iso-surface with Marching Cubes, removing interior structures by resetting invisible grid values, and filtering small/invisible components. Finally, dense surface points with normals, and random volume/near-surface points are sampled for VAE training.
3D Variational Autoencoder (VAE)
TripoSG employs a transformer-based VAE inspired by 3DShape2Vecset to compress 3D shapes into a compact latent token representation (L×C, where L can be 512, 2048, or 4096 tokens and C=64). The VAE is designed for high-quality 3D reconstruction using neural Signed Distance Function (SDF) representation, which is preferred over occupancy grids for its superior geometric expressiveness and ability to avoid aliasing artifacts.
The VAE training incorporates a novel hybrid supervised strategy:
- SDF Loss: Standard loss comparing predicted SDF values to ground truth (∣s−s^∣+∥s−s^∥22).
- Surface Normal Loss (Lsn): Applied only to surface points, this loss supervises the gradient direction of the implicit field, aligning the predicted normal (∥∇D(x,f)∥∇D(x,f)) with the ground-truth normal (n^) using cosine similarity (1−⟨∥∇D(x,f)∥∇D(x,f),n^⟩). This is crucial for learning finer geometric details.
- Eikonal Regularization (Leik): Enforces the unit norm of the gradient of the SDF field (∥∇D(x,f)−1∥22) to ensure it is a valid distance function and mitigate artifacts introduced by normal supervision.
- KL-Regularization (Lkl): Standard VAE loss in the latent space.
The VAE architecture uses a transformer with an 8-layer encoder and a 16-layer decoder. It takes a dense set of surface points (20,480 in experiments) as input for the encoder, uses cross-attention to integrate this information into latent queries (subsampled to 512 or 2048 tokens during training), and decodes the SDF using cross-attention between positional embeddings of query points in 3D space and the latent tokens. Multi-resolution training with shared weights enables the VAE to extrapolate to higher latent resolutions (4096 tokens) during flow model training without explicit retraining.
Rectified Flow Transformer
The core generative model is a transformer-based rectified flow model trained on the VAE's latent space. Inspired by DiT, it utilizes a U-Net-like structure with N=10 encoder and decoder blocks and a middle block ($2N+1$ total blocks), featuring long skip connections between corresponding encoder and decoder blocks to enhance feature fusion.
Image conditioning is injected simultaneously using separate cross-attention mechanisms for global CLIP features and local DINOv2 features in each transformer block. This allows the model to attend to both coarse and fine image details, improving consistency between the generated shape and the input image.
The model is trained using the Rectified Flow formulation, which models a linear trajectory between the noise distribution and the data distribution (xt=tx0+(1−t)ϵ). This linear path is noted for simplifying network training and improving stability compared to DDPM or EDM. Logit-normal sampling biases intermediate steps, and Resolution-Dependent Shifting of Timestep adjusts sampling based on latent resolution to maintain consistent noise levels.
Model and Resolution Scale-up
TripoSG employs a progressive scaling strategy:
- Latent Resolution: The flow model is trained initially at lower latent resolutions (512, then 2048 tokens). Due to the VAE's design, it can encode/decode at 4096 tokens without retraining, allowing the flow model training to scale directly to this higher resolution. RMSNorm is used in transformer blocks for stability during mixed-precision training at higher resolutions.
- Model Size (MoE): To scale the model parameters from 1.5B to approximately 4B while maintaining efficient inference, a Mixture-of-Experts (MoE) architecture is applied to the Feed-Forward Network (FFN) layers in the transformer blocks. Specifically, MoE is used in the final six decoder layers. Each MoE layer consists of multiple parallel FFN experts, a gating mechanism that activates the top-K experts per token (top-2 plus a shared expert in TripoSG), and an auxiliary loss for expert balancing. The MoE model is initialized from a pre-trained dense model for training stability.
Implementation and Evaluation
The full TripoSG model (4B parameters, 4096 latent resolution) is trained progressively on the 2M high-quality dataset for approximately 3 weeks on 160 A100 GPUs. Ablation studies on smaller datasets and models confirm the effectiveness of the data pipeline (quality > raw quantity; quantity matters with high quality), VAE improvements (SDF+normal+eikonal), flow architecture enhancements (skip connections, Rectified Flow), and scaling strategies (resolution, MoE).
Evaluation uses Normal-FID (comparing normal maps of generated vs. ground-truth shapes rendered from the input viewpoint) and GPTEval3D (using an LMM like Claude3.5 to rate generated 3D models across dimensions like plausibility, detail, and alignment). TripoSG demonstrates state-of-the-art performance across various complex structures, styles, and details compared to previous methods.
Texture Generation
While TripoSG primarily generates geometry (meshes via Marching Cubes from the decoded SDF), texture can be added. The high-quality normal maps rendered from the generated geometry can serve as conditional input to existing multi-view texture generation methods (e.g., Meta 3D TextureGen) to produce consistent texture images, which are then projected onto the mesh.
In conclusion, TripoSG advances 3D generation by combining a carefully curated large-scale dataset, an improved SDF-based VAE, and a scalable rectified flow transformer architecture with effective conditioning and MoE scaling, achieving significantly higher fidelity and generalization capabilities than prior methods. This aligns 3D generation more closely with the success seen in large-scale 2D and video generation models.