- The paper introduces a two-stage VAE and diffusion-based approach to generate high-fidelity 3D textured assets from single-view imagery.
- It employs a robust data curation and mesh processing pipeline that filters millions of assets and refines geometry for consistent texture mapping.
- Experimental results demonstrate superior geometry-image alignment with high CLIP scores and effective user-controlled generation via LoRA.
Step1X-3D presents an open framework designed to address the challenges in 3D asset generation, specifically data scarcity, algorithmic limitations, and ecosystem fragmentation. The framework aims for high-fidelity, controllable generation of textured 3D assets from single images and promotes reproducibility through open-source release of data, models, and training code.
The framework consists of three main components:
- Data Curation Pipeline:
- Processes over 5 million 3D assets from public (Objaverse, Objaverse-XL, ABO, 3D-FUTURE) and proprietary collections.
- Implements a multi-stage filtering process to remove low-quality data based on texture quality (using rendered albedo maps, HSV analysis), single-surface detection (using Canonical Coordinate Maps and checking pixel matches), small object size, transparent materials (alpha channel detection), incorrect normals, and mesh type/name filters.
- Ensures geometric consistency by converting non-watertight meshes to watertight representations. An enhanced mesh-to-SDF conversion method is introduced, incorporating the winding number concept to improve the conversion success rate, particularly for non-manifold objects.
- Samples points and normals for VAE training, including a Sharp Edge Sampling (SES) strategy from Dora (Dora: Sampling and Benchmarking for 3D Shape Variational Auto-Encoders, 23 Dec 2024) to capture details. Samples different point sets (volume, near-surface, on-surface) with TSDF values for supervision.
- Prepares data for the diffusion model by rendering 20 random viewpoints (with varying elevation, azimuth, focal length) for each model, applying data augmentations (flipping, color jitter, rotation).
- Results in a curated dataset of approximately 2 million high-quality assets, with around 800,000 derived from public data planned for open release.
- Step1X-3D Geometry Generation:
- Employs a two-stage approach: a 3D Shape Variational Autoencoder (VAE) and a Rectified Flow Transformer diffusion model.
- 3D Shape VAE: Based on the latent vector set representation from 3DShape2VecSet (IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models, 2023), adapted for scalability using a perceiver-based encoder-decoder architecture (Nonlinear simulation of vascular tumor growth with chemotaxis and the control of necrosis, 2021, IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models, 2023). Utilizes Farthest Points Sampling (FPS) on both uniform and salient points to initialize latent queries. Incorporates Sharp Edge Sampling and Dual Cross Attention from Dora (Dora: Sampling and Benchmarking for 3D Shape Variational Auto-Encoders, 23 Dec 2024) to preserve geometric details. The encoder processes point positions (with Fourier positional encoding) and normals. The decoder uses cross-attention to predict TSDF values for query points in 3D space. Trained with an MSE loss on TSDF values and a KL divergence term for latent space regularization. Reconstructs meshes from TSDF using Marching Cubes [1998marching] and uses Hierarchical Volume Decoding (Unleashing Vecset Diffusion Model for Fast Shape Generation, 20 Mar 2025) for faster inference.
- Rectified Flow Transformer: Adapt the MMDiT architecture from FLUX [flux2024] for 1D latent space processing. Uses a hybrid dual-stream/single-stream block structure for handling latent and condition tokens. Omits spatial positional embeddings for latent sets but keeps timestep embeddings. Uses DINOv2 (DINOv2: Learning Robust Visual Features without Supervision, 2023) and CLIP (A canonical transformation to eliminate resonant perturbations I, 2021) image encoders to extract conditional tokens from preprocessed single-view images, injected via parallel cross-attention.
- Training: Trained using a flow matching objective (Rectified Flow) (Flow Matching for Generative Modeling, 2022, Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow, 2022), predicting the velocity field. Uses a logit-normal sampling strategy to increase weight for intermediate timesteps. Employs Exponential Moving Average (EMA) for parameter updates. Trained in two phases: initially with a smaller latent set size (512) for fast convergence, then scaled up (2048) with a reduced learning rate for capacity and precision.
- Step1X-3D Texture Generation:
- Follows the geometry generation stage.
- Geometry Postprocess: Uses the trimesh toolkit [trimesh] for mesh refinement: ensuring watertightness (with hole-filling), remeshing for uniform topology (subdivision and Laplacian smoothing), and UV parameterization using xAtlas [xatlas].
- Texture Dataset Preparation: Curates a 30,000-asset subset from the cleaned Objaverse data, rendered from six canonical views to produce albedo, normal, and position maps at 768x768 resolution.
- Geometry-guided Multi-view Images Generation:
- Uses a diffusion model fine-tuned from MV-Adapter (MV-Adapter: Multi-view Consistent Image Generation Made Easy, 4 Dec 2024) (which is based on SD-XL (Hierarchical Text-Conditional Image Generation with CLIP Latents, 2022)) to generate consistent multi-view images from a single input view and target camera poses. MV-Adapter's epipolar attention enables high-resolution generation (768x768), and its attention architecture balances generalization, multi-view consistency, and condition adherence.
- Injects geometric guidance (normal and 3D position maps from the generated geometry) via image-based encoders and cross-attention mechanisms to improve detail synthesis and texture-geometry alignment.
- Implements a texture-space synchronization module during inference within the diffusion model. This involves unprojecting latent representations to UV space, fusing information from multiple views based on view direction and normal cosine similarity, and re-projecting back to latent space. This helps maintain cross-view coherence and reduces artifacts.
- Bake Texture: Upsamples the generated multi-view images to 2048x2048, inversely projects them onto the mesh's UV map, and uses continuity-aware texture inpainting (similar to techniques in Hunyuan3D 2.0 (Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation, 21 Jan 2025) and Paint3D (Groupwise Query Specialization and Quality-Aware Multi-Assignment for Transformer-based Visual Relationship Detection, 26 Mar 2024)) to address occlusions and discontinuities, producing seamless texture maps.
Controllable Generation:
Step1X-3D leverages the structural similarity between its VAE+Diffusion architecture and 2D image generation models (like Stable Diffusion) to enable direct transfer of 2D control techniques (e.g., ControlNet (Numerical analysis of a multistable capsule system under the delayed feedback control with a constant delay, 2023), IP-Adapter (IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models, 2023)) and parameter-efficient adaptation methods (like LoRA (LoRA: Low-Rank Adaptation of Large Language Models, 2021)) to 3D. As part of the open-source release, LoRA is implemented for geometric shape control based on labels (e.g., symmetry, detail level). This allows fine-tuning specific aspects of generation without training the entire model, achieved by training a small LoRA module applied to a condition branch, enabling efficient injection of conditional signals. Future updates are planned for skeleton, bounding box, caption, and image prompt conditions.
Experiments and Results:
Evaluations are conducted on a diverse benchmark of 110 images, comparing Step1X-3D against state-of-the-art open-source (Trellis (Structured 3D Latents for Scalable and Versatile 3D Generation, 2 Dec 2024), Hunyuan3D 2.0 (Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation, 21 Jan 2025), TripoSG (TripoSG: High-Fidelity 3D Shape Synthesis using Large-Scale Rectified Flow Models, 10 Feb 2025)) and proprietary (Tripo-v2.5, Rodin-v1.5, Meshy-4) methods.
- Visual Quality: Step1X-3D generates plausible geometry (shown with normal maps) and texture (shown with multi-view renders) that maintain strong similarity to input images, complete occluded regions, and exhibit good texture-geometry alignment across various styles and complexities.
- Controllable Generation: Demonstrated through LoRA fine-tuning for symmetry and geometric detail control (sharp, normal, smooth), showing the model adheres to these controls, particularly in frontal views.
- Quantitative Comparisons: Using multimodal embeddings (Uni3D (Uni3D: Exploring Unified 3D Representation at Scale, 2023), OpenShape (A Supervised Embedding and Clustering Anomaly Detection method for classification of Mobile Network Faults, 2023) with SparseConv (CraftsMan: High-fidelity Mesh Generation with 3D Native Generation and Interactive Geometry Refiner, 23 May 2024) and PointBERT (Subordination principle and Feynman-Kac formulae for generalized time-fractional evolution equations, 2022) backbones) for geometry-image alignment (Uni3D-I, OpenShapesc-I, OpenShapepb-I) and CLIP-Score (A canonical transformation to eliminate resonant perturbations I, 2021) for texture semantic alignment. Step1X-3D achieves the highest CLIP-Score and ranks among the top methods in geometric metrics, demonstrating robust performance.
- User Study: Participants rated objects on geometric plausibility, similarity to input, texture clarity, and texture-geometry alignment on a 5-point scale. Step1X-3D performs comparably to the best methods but the paper indicates that all methods still have significant room for improvement towards production-ready quality.
- Visual Comparisons: Side-by-side comparisons with SOTA methods, using pose-normalized renders in Unreal Engine, show Step1X-3D produces comparable or superior geometric and textural results.
Limitations:
Current limitations include the grid resolution (2563) used for mesh-to-TSDF conversion, which limits geometric detail, and the texture pipeline's current focus on albedo generation, lacking support for relighting and Physically Based Rendering (PBR) materials. Future work aims to address these aspects.
By combining a high-quality dataset, a novel two-stage 3D-native architecture, and mechanisms for controllable generation rooted in 2D paradigms, Step1X-3D provides a strong foundation and open resources for advancing the field of 3D asset generation.