FreeArt3D: Training-Free Articulated Object Generation using 3D Diffusion (2510.25765v1)

Published 29 Oct 2025 in cs.CV and cs.GR

Abstract: Articulated 3D objects are central to many applications in robotics, AR/VR, and animation. Recent approaches to modeling such objects either rely on optimization-based reconstruction pipelines that require dense-view supervision or on feed-forward generative models that produce coarse geometric approximations and often overlook surface texture. In contrast, open-world 3D generation of static objects has achieved remarkable success, especially with the advent of native 3D diffusion models such as Trellis. However, extending these methods to articulated objects by training native 3D diffusion models poses significant challenges. In this work, we present FreeArt3D, a training-free framework for articulated 3D object generation. Instead of training a new model on limited articulated data, FreeArt3D repurposes a pre-trained static 3D diffusion model (e.g., Trellis) as a powerful shape prior. It extends Score Distillation Sampling (SDS) into the 3D-to-4D domain by treating articulation as an additional generative dimension. Given a few images captured in different articulation states, FreeArt3D jointly optimizes the object's geometry, texture, and articulation parameters without requiring task-specific training or access to large-scale articulated datasets. Our method generates high-fidelity geometry and textures, accurately predicts underlying kinematic structures, and generalizes well across diverse object categories. Despite following a per-instance optimization paradigm, FreeArt3D completes in minutes and significantly outperforms prior state-of-the-art approaches in both quality and versatility.

Summary

The paper introduces a training-free, per-instance optimization framework that leverages frozen 3D diffusion models as shape priors.
It jointly optimizes geometry, texture, and joint parameters from sparse-view images to achieve high-fidelity, kinematically accurate 3D models.
Experimental results on PartNet-Mobility demonstrate substantial improvements over state-of-the-art baselines with competitive optimization times on a single GPU.

FreeArt3D: Training-Free Articulated Object Generation using 3D Diffusion

Introduction

FreeArt3D presents a training-free framework for articulated 3D object generation, leveraging pre-trained static 3D diffusion models (notably Trellis) as shape priors. The method circumvents the need for large-scale articulated datasets and task-specific training by extending Score Distillation Sampling (SDS) into the 3D-to-4D domain, treating articulation as an additional generative dimension. Given sparse-view images of an object in different joint states, FreeArt3D jointly optimizes geometry, texture, and articulation parameters, producing high-fidelity, textured, and kinematically accurate 3D models.

Methodology

Per-Shape Optimization Paradigm

FreeArt3D operates on a per-instance optimization paradigm. For each object, the input consists of $K$ RGB images $\{\mathbf{I}_k\}_{k=1}^K$ captured at different articulation states $\{\theta_k\}_{k=1}^K$ . The object is decomposed into a static body and a movable part, each represented by separate hash grids modeling occupancy fields. Joint parameters $\mathcal{J}$ (e.g., axis, pivot for revolute joints; axis for prismatic joints) and joint states $\theta_k$ are optimized alongside geometry.

Figure 1: FreeArt3D employs a per-shape optimization strategy, jointly optimizing body and part geometries, joint parameters, and optionally joint states from sparse-view images.

3D Diffusion Model Guidance

The optimization is guided by a frozen pre-trained 3D diffusion model (Trellis), which provides gradient signals for both geometry and joint parameters. At each iteration, the movable part is transformed according to the current joint state, merged with the static body, and the resulting occupancy grid is encoded and passed to the diffusion model. The SDS loss, computed in the latent space, distills denoising gradients from the diffusion model, while a voxel-space reconstruction loss stabilizes optimization.

Scale Normalization and Initialization

A critical challenge is scale inconsistency across joint states due to normalization in Trellis. FreeArt3D introduces a fixed reference disk beneath the object in all views, serving as a spatial anchor to ensure consistent scale and alignment.

Figure 2: Disk normalization ensures consistent component scales across joint states, stabilizing optimization.

Joint parameters are initialized via cross-state correspondences: Trellis is run independently on each input image, meshes are rendered, and 2D correspondences are detected and lifted to 3D. These are used to estimate initial joint parameters, which are then refined during optimization.

Fine-Grained Geometry and Texture Generation

After coarse optimization, occupancy grids are denoised and cleaned. The cleaned grid is used to construct a sparse latent feature volume, which is denoised and decoded by the second-stage Trellis model to produce FlexiCubes coefficients (for mesh extraction) and Gaussian Splatting parameters (for texture synthesis). Meshes for the static and movable parts are extracted and textured, then combined with optimized joint parameters to yield the final articulated object.

Experimental Results

Quantitative and Qualitative Evaluation

FreeArt3D is evaluated on the PartNet-Mobility dataset across 12 categories, using metrics including Chamfer Distance (CD), F-Score, CLIP similarity, joint axis direction error, and joint pivot error. The method significantly outperforms state-of-the-art baselines (Articulate-Anything, Singapo, URDFormer, PARIS) in all metrics, demonstrating superior geometric fidelity, texture realism, and kinematic accuracy.

Figure 3: Comparison between Singapo, Articulate-Anything, and FreeArt3D. FreeArt3D reconstructs detailed geometry and textures, closely matching input images and ground truth.

Runtime analysis shows FreeArt3D completes per-shape optimization in approximately 10 minutes on a single NVIDIA H100 GPU, which is competitive given the quality and generalizability achieved.

Real-World Applicability

FreeArt3D generalizes to real-world images captured with consumer devices, requiring only sparse views and no camera pose information. The disk normalization can be implemented physically or virtually.

Figure 4: FreeArt3D reconstructs high-quality articulated objects from casually captured, unposed real-world images.

Multi-Joint Extension

The framework is readily extensible to objects with multiple joints and mixed joint types by representing each part with a separate hash grid and jointly optimizing all geometries and joint parameters.

Figure 5: FreeArt3D supports generation of objects with multiple articulated parts and joints, enabling flexible configuration.

Ablation Studies

Ablation studies confirm the necessity of SDS loss, voxel loss, disk normalization, joint initialization, hash grid representation, and voxel refinement. Increasing the number of input views improves performance, while the method remains robust with minimal input.

Figure 6: Ablation paper demonstrates the impact of occupancy refinement after coarse geometry optimization.

Failure case analysis indicates robustness, with error rates below 10% for incorrect axis, pivot, or segmentation when using six input images and disk normalization.

Implementation Considerations

Computational Requirements: The method is GPU-intensive, with per-shape optimization requiring several thousand iterations. Hash grid representation (TinyCudaNN) is recommended for efficiency and stability.
Input Requirements: Sparse-view images in different joint states; disk normalization is essential for scale consistency.
Generalizability: The approach is not limited by predefined categories or part templates, enabling open-world applicability.
Limitations: Optimization speed and robustness could be further improved; normalization strategies may be refined for more complex scenes.

Implications and Future Directions

FreeArt3D demonstrates that training-free, per-instance optimization guided by static 3D diffusion models is a viable paradigm for articulated object generation, overcoming dataset scarcity and generalization limitations of retrieval-based and template-based methods. The approach is directly applicable to robotics, AR/VR, and digital twin pipelines, where accurate geometry, texture, and kinematic structure are critical.

Future work may focus on accelerating optimization, improving robustness to challenging input conditions, and exploring more sophisticated normalization and initialization strategies. Integration with large-scale 3D generative models and further automation of joint parameter estimation are promising directions.

Conclusion

FreeArt3D establishes a new standard for articulated 3D object generation, achieving high-fidelity geometry, realistic textures, and accurate kinematic modeling from sparse input views without task-specific training. The method's extensibility, generalizability, and real-world applicability position it as a practical solution for diverse applications in computer vision, graphics, and robotics.