Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Step1X-3D: Towards High-Fidelity and Controllable Generation of Textured 3D Assets (2505.07747v1)

Published 12 May 2025 in cs.CV

Abstract: While generative artificial intelligence has advanced significantly across text, image, audio, and video domains, 3D generation remains comparatively underdeveloped due to fundamental challenges such as data scarcity, algorithmic limitations, and ecosystem fragmentation. To this end, we present Step1X-3D, an open framework addressing these challenges through: (1) a rigorous data curation pipeline processing >5M assets to create a 2M high-quality dataset with standardized geometric and textural properties; (2) a two-stage 3D-native architecture combining a hybrid VAE-DiT geometry generator with an diffusion-based texture synthesis module; and (3) the full open-source release of models, training code, and adaptation modules. For geometry generation, the hybrid VAE-DiT component produces TSDF representations by employing perceiver-based latent encoding with sharp edge sampling for detail preservation. The diffusion-based texture synthesis module then ensures cross-view consistency through geometric conditioning and latent-space synchronization. Benchmark results demonstrate state-of-the-art performance that exceeds existing open-source methods, while also achieving competitive quality with proprietary solutions. Notably, the framework uniquely bridges the 2D and 3D generation paradigms by supporting direct transfer of 2D control techniques~(e.g., LoRA) to 3D synthesis. By simultaneously advancing data quality, algorithmic fidelity, and reproducibility, Step1X-3D aims to establish new standards for open research in controllable 3D asset generation.

Summary

  • The paper introduces a two-stage VAE and diffusion-based approach to generate high-fidelity 3D textured assets from single-view imagery.
  • It employs a robust data curation and mesh processing pipeline that filters millions of assets and refines geometry for consistent texture mapping.
  • Experimental results demonstrate superior geometry-image alignment with high CLIP scores and effective user-controlled generation via LoRA.

Step1X-3D presents an open framework designed to address the challenges in 3D asset generation, specifically data scarcity, algorithmic limitations, and ecosystem fragmentation. The framework aims for high-fidelity, controllable generation of textured 3D assets from single images and promotes reproducibility through open-source release of data, models, and training code.

The framework consists of three main components:

  1. Data Curation Pipeline:
    • Processes over 5 million 3D assets from public (Objaverse, Objaverse-XL, ABO, 3D-FUTURE) and proprietary collections.
    • Implements a multi-stage filtering process to remove low-quality data based on texture quality (using rendered albedo maps, HSV analysis), single-surface detection (using Canonical Coordinate Maps and checking pixel matches), small object size, transparent materials (alpha channel detection), incorrect normals, and mesh type/name filters.
    • Ensures geometric consistency by converting non-watertight meshes to watertight representations. An enhanced mesh-to-SDF conversion method is introduced, incorporating the winding number concept to improve the conversion success rate, particularly for non-manifold objects.
    • Samples points and normals for VAE training, including a Sharp Edge Sampling (SES) strategy from Dora (Dora: Sampling and Benchmarking for 3D Shape Variational Auto-Encoders, 23 Dec 2024) to capture details. Samples different point sets (volume, near-surface, on-surface) with TSDF values for supervision.
    • Prepares data for the diffusion model by rendering 20 random viewpoints (with varying elevation, azimuth, focal length) for each model, applying data augmentations (flipping, color jitter, rotation).
    • Results in a curated dataset of approximately 2 million high-quality assets, with around 800,000 derived from public data planned for open release.
  2. Step1X-3D Geometry Generation:
  3. Step1X-3D Texture Generation:
    • Follows the geometry generation stage.
    • Geometry Postprocess: Uses the trimesh toolkit [trimesh] for mesh refinement: ensuring watertightness (with hole-filling), remeshing for uniform topology (subdivision and Laplacian smoothing), and UV parameterization using xAtlas [xatlas].
    • Texture Dataset Preparation: Curates a 30,000-asset subset from the cleaned Objaverse data, rendered from six canonical views to produce albedo, normal, and position maps at 768x768 resolution.
    • Geometry-guided Multi-view Images Generation:
      • Uses a diffusion model fine-tuned from MV-Adapter (MV-Adapter: Multi-view Consistent Image Generation Made Easy, 4 Dec 2024) (which is based on SD-XL (Hierarchical Text-Conditional Image Generation with CLIP Latents, 2022)) to generate consistent multi-view images from a single input view and target camera poses. MV-Adapter's epipolar attention enables high-resolution generation (768x768), and its attention architecture balances generalization, multi-view consistency, and condition adherence.
      • Injects geometric guidance (normal and 3D position maps from the generated geometry) via image-based encoders and cross-attention mechanisms to improve detail synthesis and texture-geometry alignment.
      • Implements a texture-space synchronization module during inference within the diffusion model. This involves unprojecting latent representations to UV space, fusing information from multiple views based on view direction and normal cosine similarity, and re-projecting back to latent space. This helps maintain cross-view coherence and reduces artifacts.
    • Bake Texture: Upsamples the generated multi-view images to 2048x2048, inversely projects them onto the mesh's UV map, and uses continuity-aware texture inpainting (similar to techniques in Hunyuan3D 2.0 (Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation, 21 Jan 2025) and Paint3D (Groupwise Query Specialization and Quality-Aware Multi-Assignment for Transformer-based Visual Relationship Detection, 26 Mar 2024)) to address occlusions and discontinuities, producing seamless texture maps.

Controllable Generation:

Step1X-3D leverages the structural similarity between its VAE+Diffusion architecture and 2D image generation models (like Stable Diffusion) to enable direct transfer of 2D control techniques (e.g., ControlNet (Numerical analysis of a multistable capsule system under the delayed feedback control with a constant delay, 2023), IP-Adapter (IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models, 2023)) and parameter-efficient adaptation methods (like LoRA (LoRA: Low-Rank Adaptation of Large Language Models, 2021)) to 3D. As part of the open-source release, LoRA is implemented for geometric shape control based on labels (e.g., symmetry, detail level). This allows fine-tuning specific aspects of generation without training the entire model, achieved by training a small LoRA module applied to a condition branch, enabling efficient injection of conditional signals. Future updates are planned for skeleton, bounding box, caption, and image prompt conditions.

Experiments and Results:

Evaluations are conducted on a diverse benchmark of 110 images, comparing Step1X-3D against state-of-the-art open-source (Trellis (Structured 3D Latents for Scalable and Versatile 3D Generation, 2 Dec 2024), Hunyuan3D 2.0 (Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation, 21 Jan 2025), TripoSG (TripoSG: High-Fidelity 3D Shape Synthesis using Large-Scale Rectified Flow Models, 10 Feb 2025)) and proprietary (Tripo-v2.5, Rodin-v1.5, Meshy-4) methods.

Limitations:

Current limitations include the grid resolution (2563) used for mesh-to-TSDF conversion, which limits geometric detail, and the texture pipeline's current focus on albedo generation, lacking support for relighting and Physically Based Rendering (PBR) materials. Future work aims to address these aspects.

By combining a high-quality dataset, a novel two-stage 3D-native architecture, and mechanisms for controllable generation rooted in 2D paradigms, Step1X-3D provides a strong foundation and open resources for advancing the field of 3D asset generation.

Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com