Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

TripoSG: High-Fidelity 3D Shape Synthesis using Large-Scale Rectified Flow Models (2502.06608v3)

Published 10 Feb 2025 in cs.CV and cs.AI

Abstract: Recent advancements in diffusion techniques have propelled image and video generation to unprecedented levels of quality, significantly accelerating the deployment and application of generative AI. However, 3D shape generation technology has so far lagged behind, constrained by limitations in 3D data scale, complexity of 3D data processing, and insufficient exploration of advanced techniques in the 3D domain. Current approaches to 3D shape generation face substantial challenges in terms of output quality, generalization capability, and alignment with input conditions. We present TripoSG, a new streamlined shape diffusion paradigm capable of generating high-fidelity 3D meshes with precise correspondence to input images. Specifically, we propose: 1) A large-scale rectified flow transformer for 3D shape generation, achieving state-of-the-art fidelity through training on extensive, high-quality data. 2) A hybrid supervised training strategy combining SDF, normal, and eikonal losses for 3D VAE, achieving high-quality 3D reconstruction performance. 3) A data processing pipeline to generate 2 million high-quality 3D samples, highlighting the crucial rules for data quality and quantity in training 3D generative models. Through comprehensive experiments, we have validated the effectiveness of each component in our new framework. The seamless integration of these parts has enabled TripoSG to achieve state-of-the-art performance in 3D shape generation. The resulting 3D shapes exhibit enhanced detail due to high-resolution capabilities and demonstrate exceptional fidelity to input images. Moreover, TripoSG demonstrates improved versatility in generating 3D models from diverse image styles and contents, showcasing strong generalization capabilities. To foster progress and innovation in the field of 3D generation, we will make our model publicly available.

Summary

  • The paper introduces a novel framework that synthesizes high-fidelity 3D shapes from a single image using large-scale rectified flow models and an advanced data pipeline.
  • It leverages a transformer-based VAE with supervised SDF, surface normal, eikonal, and KL losses to achieve precise and detailed geometric reconstruction.
  • Scaling strategies including latent resolution upscaling and a Mixture-of-Experts architecture enable state-of-the-art performance validated by Normal-FID and GPTEval3D metrics.

TripoSG is a novel framework for high-fidelity 3D shape synthesis from a single image, leveraging large-scale rectified flow models. The paper addresses the challenges in 3D shape generation, specifically limitations in data scale and quality, the complexity of 3D processing, and the need for advanced generative techniques that yield high-quality outputs with precise alignment to input conditions.

The core of TripoSG consists of two main components: a robust Data-Building System and a large-scale Rectified Flow Transformer model trained on latent 3D representations.

Data-Building System

A key contribution is the development of a sophisticated data processing pipeline to curate and prepare a large dataset of 2 million high-quality 3D samples from diverse sources like Objaverse and ShapeNet. This system emphasizes the critical role of data quality and quantity for training 3D generative models. The pipeline involves four stages:

  1. Data Scoring: Models are scored based on visual quality using a linear regression model trained on features from multi-view normal maps manually rated by professionals (using CLIP and DINOv2 features).
  2. Data Filtering: Automated filters remove models with issues like large planar bases, rendering errors, or multiple objects based on geometric and visual properties.
  3. Data Fixing and Augmentation: This stage includes orientation fixing for character models (using a trained orientation estimation model based on multi-view DINOv2 features) and generating multi-view RGB data for untextured models using ControlNet++ conditioned on normal maps.
  4. Field Data Production: Non-watertight meshes are converted into a watertight representation suitable for neural implicit fields. This is done by constructing a 5123512^3 Unsigned Distance Function (UDF) grid, extracting an iso-surface with Marching Cubes, removing interior structures by resetting invisible grid values, and filtering small/invisible components. Finally, dense surface points with normals, and random volume/near-surface points are sampled for VAE training.

3D Variational Autoencoder (VAE)

TripoSG employs a transformer-based VAE inspired by 3DShape2Vecset to compress 3D shapes into a compact latent token representation (L×CL \times C, where LL can be 512, 2048, or 4096 tokens and C=64C=64). The VAE is designed for high-quality 3D reconstruction using neural Signed Distance Function (SDF) representation, which is preferred over occupancy grids for its superior geometric expressiveness and ability to avoid aliasing artifacts.

The VAE training incorporates a novel hybrid supervised strategy:

  • SDF Loss: Standard loss comparing predicted SDF values to ground truth (ss^+ss^22|s - \hat{s}| + \|s - \hat{s}\|_2^2).
  • Surface Normal Loss (Lsn\mathcal{L}_{\text{sn}}): Applied only to surface points, this loss supervises the gradient direction of the implicit field, aligning the predicted normal (D(x,f)D(x,f)\frac{\nabla \mathcal{D}(x, f)}{\|\nabla \mathcal{D}(x, f)\|}) with the ground-truth normal (n^\hat{n}) using cosine similarity (1<D(x,f)D(x,f),n^>)(1 - \left<\frac{\nabla \mathcal{D}(x, f)}{\|\nabla \mathcal{D}(x, f)\|}, \hat{n}\right>). This is crucial for learning finer geometric details.
  • Eikonal Regularization (Leik\mathcal{L}_{\text{eik}}): Enforces the unit norm of the gradient of the SDF field (D(x,f)122\|\nabla \mathcal{D}(x, f)-1\|^2_2) to ensure it is a valid distance function and mitigate artifacts introduced by normal supervision.
  • KL-Regularization (Lkl\mathcal{L}_{\text{kl}}): Standard VAE loss in the latent space.

The VAE architecture uses a transformer with an 8-layer encoder and a 16-layer decoder. It takes a dense set of surface points (20,480 in experiments) as input for the encoder, uses cross-attention to integrate this information into latent queries (subsampled to 512 or 2048 tokens during training), and decodes the SDF using cross-attention between positional embeddings of query points in 3D space and the latent tokens. Multi-resolution training with shared weights enables the VAE to extrapolate to higher latent resolutions (4096 tokens) during flow model training without explicit retraining.

Rectified Flow Transformer

The core generative model is a transformer-based rectified flow model trained on the VAE's latent space. Inspired by DiT, it utilizes a U-Net-like structure with N=10N=10 encoder and decoder blocks and a middle block ($2N+1$ total blocks), featuring long skip connections between corresponding encoder and decoder blocks to enhance feature fusion.

Image conditioning is injected simultaneously using separate cross-attention mechanisms for global CLIP features and local DINOv2 features in each transformer block. This allows the model to attend to both coarse and fine image details, improving consistency between the generated shape and the input image.

The model is trained using the Rectified Flow formulation, which models a linear trajectory between the noise distribution and the data distribution (xt=tx0+(1t)ϵx_t = t x_0 + (1-t)\epsilon). This linear path is noted for simplifying network training and improving stability compared to DDPM or EDM. Logit-normal sampling biases intermediate steps, and Resolution-Dependent Shifting of Timestep adjusts sampling based on latent resolution to maintain consistent noise levels.

Model and Resolution Scale-up

TripoSG employs a progressive scaling strategy:

  1. Latent Resolution: The flow model is trained initially at lower latent resolutions (512, then 2048 tokens). Due to the VAE's design, it can encode/decode at 4096 tokens without retraining, allowing the flow model training to scale directly to this higher resolution. RMSNorm is used in transformer blocks for stability during mixed-precision training at higher resolutions.
  2. Model Size (MoE): To scale the model parameters from 1.5B to approximately 4B while maintaining efficient inference, a Mixture-of-Experts (MoE) architecture is applied to the Feed-Forward Network (FFN) layers in the transformer blocks. Specifically, MoE is used in the final six decoder layers. Each MoE layer consists of multiple parallel FFN experts, a gating mechanism that activates the top-K experts per token (top-2 plus a shared expert in TripoSG), and an auxiliary loss for expert balancing. The MoE model is initialized from a pre-trained dense model for training stability.

Implementation and Evaluation

The full TripoSG model (4B parameters, 4096 latent resolution) is trained progressively on the 2M high-quality dataset for approximately 3 weeks on 160 A100 GPUs. Ablation studies on smaller datasets and models confirm the effectiveness of the data pipeline (quality > raw quantity; quantity matters with high quality), VAE improvements (SDF+normal+eikonal), flow architecture enhancements (skip connections, Rectified Flow), and scaling strategies (resolution, MoE).

Evaluation uses Normal-FID (comparing normal maps of generated vs. ground-truth shapes rendered from the input viewpoint) and GPTEval3D (using an LMM like Claude3.5 to rate generated 3D models across dimensions like plausibility, detail, and alignment). TripoSG demonstrates state-of-the-art performance across various complex structures, styles, and details compared to previous methods.

Texture Generation

While TripoSG primarily generates geometry (meshes via Marching Cubes from the decoded SDF), texture can be added. The high-quality normal maps rendered from the generated geometry can serve as conditional input to existing multi-view texture generation methods (e.g., Meta 3D TextureGen) to produce consistent texture images, which are then projected onto the mesh.

In conclusion, TripoSG advances 3D generation by combining a carefully curated large-scale dataset, an improved SDF-based VAE, and a scalable rectified flow transformer architecture with effective conditioning and MoE scaling, achieving significantly higher fidelity and generalization capabilities than prior methods. This aligns 3D generation more closely with the success seen in large-scale 2D and video generation models.