FreeArt3D: Articulated 3D Reconstruction

Updated 13 November 2025

FreeArt3D is a training-free framework for articulated 3D object reconstruction, integrating high-fidelity static diffusion priors.
It extends Score Distillation Sampling into a 3D-to-4D regime by jointly optimizing geometry, texture, and joint parameters.
The method achieves state-of-the-art performance on geometry, appearance, and kinematic metrics in minutes, supporting diverse real-world applications.

FreeArt3D is a training-free, per-instance optimization framework for articulated 3D object reconstruction and generation. It is designed to recover high-fidelity geometry, realistic textures, and accurate kinematic structures of articulated objects using sparse sets of images depicting different articulation states. The framework leverages a pre-trained static 3D diffusion model (notably, Trellis) as a shape and appearance prior, circumventing the need for large, labeled datasets and task-specific retraining. FreeArt3D introduces a 3D-to-4D extension of Score Distillation Sampling (SDS), incorporating articulation as an additional generative dimension and jointly optimizing geometry, texture, and joint parameters per object instance in minutes.

1. Motivation and Key Innovations

The modeling of articulated 3D objects is fundamental for applications in robotics, augmented/virtual reality, and animation. Conventional techniques either rely on optimization-based multi-view reconstruction—which is typically constrained by dense-view requirements—or on generative models that often yield limited geometric and appearance fidelity and rarely accommodate articulation. The success of native 3D diffusion models like Trellis for static object generation motivates FreeArt3D's approach, which sidesteps the data scarcity and retraining challenges in the articulated case.

FreeArt3D's primary innovation is the extension of the SDS mechanism into the 3D-to-4D regime, treating the articulation state $\theta$ as an additional generative variable. This is accomplished by embedding articulated pose into the optimization loop, allowing the framework to produce articulated object reconstructions with precise mesh quality and kinematic accuracy, outperforming prior state-of-the-art methods on several benchmarks.

2. Mathematical Framework

2.1 Extended Score Distillation Sampling (SDS)

Given image observations $I_k$ of an object in $K$ articulation states, the optimizable parameter vector $\psi$ comprises the continuous multi-level hash grid weights, joint parameters, and articulation states for each view. Starting from a VAE-encoded latent $z = E_{\mathrm{occ}}(x)$ for occupancy grid $x$ , the SDS loss using a frozen rectified flow model $\mathcal{RF}_{\mathrm{occ}}$ is: $\mathcal{L}_{\mathrm{SDS}}(\psi) = \mathbb{E}_{t,\epsilon}\left[ w(t)\,\left(\hat\epsilon_{\mathcal{RF}_{\mathrm{occ}}}(z_t;I_k,t)-\epsilon\right) \cdot \frac{\partial z_t}{\partial\psi} \right],$ where $z_t = \alpha_t z + \sigma_t \epsilon$ , $\epsilon\sim\mathcal{N}(0,I)$ . Gradients are chained through the occupancy grid to all optimizable parameters.

2.2 Voxel-space Regularization

An additional L $_2$ reconstruction loss at the voxel level,

$\mathcal{L}_{\mathrm{vox}} = \| D_{\mathrm{occ}}(\hat z_0) - x \|_2^2,$

encourages the denoised latent occupancy to match the grid-based estimate. The total coarse-stage loss is

$\mathcal{L}_{\mathrm{total}} = \lambda_{\mathrm{SDS}} \mathcal{L}_{\mathrm{SDS}} + \lambda_{\mathrm{vox}} \mathcal{L}_{\mathrm{vox}},$

with $\lambda_{\mathrm{SDS}} = 0.1$ and $\lambda_{\mathrm{vox}} = 1.0$ .

2.3 Articulated Parametrization

Objects are factorized into static body ( $\mathcal{M}_{\text{body}}$ ) and movable part ( $\mathcal{M}_{\text{part}}$ ), with articulation specified by joint parameters $\mathcal{J}$ .

Revolute joint: axis $\mathbf{a}\in\mathbb{R}^3$ , pivot $\mathbf{p}\in\mathbb{R}^3$ , and state $\theta\in\mathbb{R}$ .
Prismatic joint: axis $\mathbf{a}$ and translation $\theta$ .

The occupancy grid at joint state $\theta_k$ is: $x(c) = \max\Bigl( \mathrm{Occ}_{\text{body}}(c), \mathrm{Occ}_{\text{part}}(\mathcal{T}^{-1}_{\theta_k}(c)) \Bigr),$ where $\mathcal{T}^{-1}_{\theta_k}$ is the inverse articulation transformation.

Joint optimization is realized via

$\nabla_\psi \mathcal{L}_{\mathrm{total}}$

with Adam, $\text{lr}=10^{-2}$ , $3000$ iterations.

3. Algorithmic Pipeline

The FreeArt3D per-object optimization workflow comprises several distinct phases:

Step	Description	Details
1	Normalization	Insert a fixed reference disk beneath the object in all images.
2	Initialization	(a) Trellis inference per view $\rightarrow$ rough mesh; (b) 2D correspondence detection via LoFTR; (c) Lift 2D to 3D correspondence and estimate $\mathcal{J},\theta_k$ ; (d) Hash grid initialization from one view.
3	Coarse Optimization	For $3000$ steps: sample $k$ , form $x$ via articulated occupancy, encode to latent, compute SDS and voxel gradients, update $\psi$ .
4	Occupancy Refinement & Cleaning	Denoise max-joint occupancy, remove disk and outliers.
5	Fine Geometry & Texture	Build sparse feature volume, denoise, decode, mesh extraction (FlexiCubes), texture baking (Gaussian Splatting).
6	Output	Articulated mesh $\mathcal{M}_{\text{body}}$ , $\mathcal{M}_{\text{part}}$ and $\mathcal{J}$ .

Key operational parameters include $K=6$ multi-state images, $3000$ optimization steps, SDS noise level $[0.5, 0.8]$ , and reference disk radius aligned with the object’s footprint. Convergence is achieved within approximately $10$ minutes on an NVIDIA H100 GPU.

4. Repurposing Static 3D Diffusion Models

FreeArt3D exclusively utilizes frozen Trellis rectified flow models and decoders throughout optimization; rigid retraining on articulated data is strictly unnecessary. Instead, articulation is embedded directly into the SDS loop: each posed occupancy grid is conditionally guided by the corresponding observation. Gradients from $\mathcal{RF}_{\mathrm{occ}}$ act as a consistent 3D shape prior, providing robust pose-conditioned guidance and mitigating Janus ambiguities seen in 2D-to-3D SDS. This strategy exploits the rich structural prior captured by Trellis for static shapes while enabling successful articulation modeling via per-instance gradient-based optimization.

5. Experimental Results

FreeArt3D is evaluated on PartNet-Mobility (12 categories, 144 shapes), using the following metrics:

Geometry: Chamfer Distance (CD) on 100K points; F-Score (threshold $= 0.05$ )
Appearance: CLIP-SIM (ViT-L/14@336px, 5 rendered views)
Articulation: axis-error $e_\text{axis}$ , pivot-error $e_\text{pivot}$

Quantitative results for five representative categories:

Method	F-Score	CD	CLIP-SIM	axis-err	pivot-err
FreeArt3D	0.883	0.026	0.882	0.208	0.092
Articulate-Anything	0.817	0.036	0.876	0.527	0.136
Singapo	0.832	0.031	0.859	0.247	0.094

FreeArt3D achieves state-of-the-art performance on all metrics, particularly outperforming on joint estimation (axis-error and pivot-error). Qualitatively, it reconstructs intricate details such as handle curves and textures, and accurately recovers joint axes where retrieval- and template-based methods underperform.

6. Generalization, Real-World Applicability, and Limitations

FreeArt3D generalizes to more than 12 object categories and can readily accommodate additional types or multi-joint chains. Its robustness to real-world image captures—using a physical or virtual disk for scale normalization—enables use with casual phone images as input, though the disk is critical for consistent results.

Noted failure modes include inaccurate axis or state initialization, geometry collapse when disk normalization is missing, and segmentation errors from highly limited views (particularly 2-view cases). The reported success rate is approximately 77% for $K=6$ views. Future work directions target further acceleration (e.g., meta-initialization), the removal of the scale disk (through learned priors), and integration of faster inference paradigms typical of vision LLMs.

7. Significance and Impact

FreeArt3D establishes a practical, training-free methodology for articulated 3D object generation, marrying high-fidelity static shape priors from native 3D diffusion models with explicit kinematic optimization. By extending SDS into the domain of articulated structures, FreeArt3D facilitates the recovery of geometry, appearance, and articulation without requiring task-specific training, yielding high quality and broad applicability. This suggests plausible broader impact in domains where articulated 3D reconstructions are vital and annotations or large-scale multi-articulation datasets are impractical to obtain.

PDF Markdown Chat (Pro)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to FreeArt3D.