RigAnything: Template-Free Autoregressive Rigging for Diverse 3D Assets (2502.09615v1)

Published 13 Feb 2025 in cs.CV

Abstract: We present RigAnything, a novel autoregressive transformer-based model, which makes 3D assets rig-ready by probabilistically generating joints, skeleton topologies, and assigning skinning weights in a template-free manner. Unlike most existing auto-rigging methods, which rely on predefined skeleton template and are limited to specific categories like humanoid, RigAnything approaches the rigging problem in an autoregressive manner, iteratively predicting the next joint based on the global input shape and the previous prediction. While autoregressive models are typically used to generate sequential data, RigAnything extends their application to effectively learn and represent skeletons, which are inherently tree structures. To achieve this, we organize the joints in a breadth-first search (BFS) order, enabling the skeleton to be defined as a sequence of 3D locations and the parent index. Furthermore, our model improves the accuracy of position prediction by leveraging diffusion modeling, ensuring precise and consistent placement of joints within the hierarchy. This formulation allows the autoregressive model to efficiently capture both spatial and hierarchical relationships within the skeleton. Trained end-to-end on both RigNet and Objaverse datasets, RigAnything demonstrates state-of-the-art performance across diverse object types, including humanoids, quadrupeds, marine creatures, insects, and many more, surpassing prior methods in quality, robustness, generalizability, and efficiency. Please check our website for more details: https://www.liuisabella.com/RigAnything.

Summary

The paper introduces RigAnything, a template-free autoregressive model that generates complete 3D asset rigs by iteratively predicting joints, skeleton topology, and skinning weights.
It uses an autoregressive transformer architecture, representing the skeleton via a BFS order and applying diffusion modeling for precise joint position prediction, achieving state-of-the-art auto-rigging performance.
Unlike previous methods restricted to specific categories, RigAnything is trained on diverse datasets like Objaverse to rig a wide range of 3D asset types, demonstrating broad generalizability.

The paper "RigAnything: Template-Free Autoregressive Rigging \ for Diverse 3D Assets" (2502.09615) introduces RigAnything, an autoregressive transformer-based model for automatic rigging of 3D assets. The model generates joints, skeleton topologies, and skinning weights without relying on predefined templates. RigAnything addresses the limitations of existing auto-rigging methods that are often restricted to specific categories like humanoids by approaching the rigging problem in an autoregressive manner. It iteratively predicts the next joint based on the global input shape and previous predictions. The method extends the application of autoregressive models to learn and represent skeletons, which are inherently tree structures, by organizing the joints in a breadth-first search (BFS) order. The model leverages diffusion modeling to improve the accuracy of position prediction, ensuring precise and consistent placement of joints within the hierarchy.

Key aspects of the method include:

Autoregressive Skeleton Prediction: The tree-structured skeleton is represented as a sequence using a BFS order, where each joint is defined by a 3D position $j_k \in \mathbb{R}^3$ $j_{k} \in R^{3}$ and a parent index $p_k$ $p_{k}$ . The joint probability of the skeleton $\mathcal{J}$ $J$ given the input shape $S$ $S$ is factorized using the chain rule:

$P(\mathcal{J} \mid S) = \prod_{k=1}^{K} P\left(j_k, p_k \mid \mathcal{J}_{1:k-1}, S\right)$
- $\mathcal{J}$ : Skeleton
- $S$ : Input shape
- $j_k$ : 3D position of the $k$ -th joint
- $p_k$ : Parent index of the $k$ -th joint
- $K$ : Total number of joints
- $\mathcal{J}_{1:k-1}$ : Sublist of $\mathcal{J}$ up to the $k$ -th element
The conditional distribution of each joint position $j_k$ and parent index $p_k$ is predicted iteratively:

$P\left(j_k, p_k \mid \mathcal{J}_{1:k-1}, S\right) = P\left(j_k\mid \mathcal{J}_{1:k-1}, S\right) P\left(p_k\mid j_k, \mathcal{J}_{1:k-1}, S\right)$
Joint Prediction with Diffusion Model: A diffusion sampling process is employed to predict continuously valued joint positions. The forward diffusion process gradually adds Gaussian noise to the ground-truth joint $j^0$ over $M$ time steps:

$j^m = \sqrt{\bar{\alpha}_m} j^0 + \sqrt{1 - \bar{\alpha}_m} \epsilon$ ,

where $\epsilon \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ and $\bar{\alpha}_m = \prod_{s=1}^m \alpha_s$ . A noise estimator $\epsilon_\theta$ is trained to predict the added noise, conditioned on the diffusion time step $m$ and the context $Z$ . The training objective is:

$\mathcal{L}_{\text{joint}}(Z, j^0) = \mathbb{E}_{\epsilon, m} \big[ \| \epsilon - \epsilon_\theta(j^m \mid m, Z) \|^2 \big]$ .

At inference, the reverse process iteratively removes noise to sample the next joint position $j^0 \sim p_\theta(j^0 \mid Z)$ :

$j^{m-1} = \frac{1}{\sqrt{\alpha_m}} \big(j^m - \frac{1 - \alpha_m}{\sqrt{1 - \bar{\alpha}_m}} \epsilon_\theta(j^m \mid m, Z)\big) + \sigma_m \delta$ ,

where $\delta \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ and $\sigma_m$ denotes the noise level.
- $j^m$ : Noisy version of the ground-truth joint
- $\bar{\alpha}_m$ : Noise schedule
- $\epsilon$ : Gaussian noise
- $\epsilon_\theta$ : Noise estimator
- $Z$ : Context, capturing both the evolving skeleton state and the input shape
- $\alpha_m$ : Parameter related to the noise schedule
- $\sigma_m$ : Noise level at step $m$
- $\delta$ : Gaussian noise
Connectivity Prediction: After sampling the next joint position $j_{k}$ $j_{k}$ , the model predicts its connection to ancestor joints. A connectivity module C takes the updated context $Z'_{k}$ $Z_{k}^{'}$ and each predicted skeleton token $T_i (i < k)$ $T_{i} (i < k)$ to produce the parent joint probability:

$\mathbf{q}_{k} = \text{Softmax}\Bigl ([ \text{C} (Z'_{k}, T_i) ]_{i=1}^{k-1} \Bigr )$ .

The connectivity is supervised with the binary cross-entropy loss:

$\mathcal{L}{\text{connect} = - \sum{i=1}{k-1} \bigl[ \hat{y}{k,i} \log\bigl(q{k,i}\bigr) \;+\; \bigl(1 - \hat{y}{k,i}\bigr) \log\bigl(1 - q{k,i}\bigr) \bigr]$,

where $q_{k,i}$ is the $i$ -th element in $\mathbf{q}_k$ and $\hat{y}_{k,i} \in \{0, 1 \}$ is the ground-truth label.
- $Z'_k$ : Updated context with the sampled joint $j_k$
- $T_i$ : Predicted skeleton token
- $\mathbf{q}_k$ : Parent joint probability
- $q_{k,i}$ : $i$ -th element in $\mathbf{q}_k$
- $\hat{y}_{k,i}$ : Ground-truth label indicating whether joint $j_{k}$ is connected to joint $j_i$
Skinning Prediction: Skinning weights are described by a matrix $W \in \mathbb{R}^{L \times K}$ $W \in R^{L \times K}$ , where each element $w_{lk}$ $w_{l k}$ indicates the influence of the $k$ $k$ -th joint on the $l$ $l$ -th surface point. A skinning prediction module G takes as input the shape token $H_{s_l}$ $H_{s_{l}}$ for point $s_l$ $s_{l}$ , along with the skeleton token $T_k$ $T_{k}$ for each joint $j_k$ $j_{k}$ , and outputs a predicted influence score. The final skinning weight $\mathbf{w}_l$ $w_{l}$ is computed using the softmax function:

$\mathbf{w}_l = \text{Softmax}\Bigl([\text{G}(H_{s_l}, T_k)]_{k=1}^{K}\Bigr)$ .

The module is trained by minimizing a weighted cross-entropy loss:

(\mathcal{L}{\text{skinning} = \frac{1}{L} \sum{l=1}^{L} \Bigl(- \sum_{k=1}^{K} \hat{w}{l,k} \,\log\bigl(w{l,k}\bigr)\Bigr)).
- $W$ : Skinning weights matrix
- $w_{lk}$ : Influence of the $k$ -th joint on the $l$ -th surface point
- $\mathbf{w}_l$ : Weight vector for each surface point
- $H_{s_l}$ : Shape token for point $s_l$
- $T_k$ : Skeleton token for each joint $j_k$
- $\hat{w}_{l,k}$ : Ground-truth skinning weight
Autoregressive Transformer Architecture: The model uses a transformer-based architecture to output shape tokens $H\in \mathbb{R}^{L \times d}$ and skeleton tokens $T_{1:k}\in\mathbb{R}^{k\times d}$ , which serve as conditional inputs for skeleton and skinning prediction. The transformer processes these tokens through a series of attention blocks with attention masking to obtain the final shape and skeleton tokens. A hybrid attention mechanism is used, where shape tokens attend to each other via full self-attention, and skeleton tokens attend to all shape tokens and apply causal attention among themselves.

The model is trained end-to-end on the RigNet [rignet] and Objaverse [deitke2023objaverse] datasets. The Objaverse dataset was filtered to select 9686 high-quality rigged shapes. Input shapes are augmented with random pose variations to enhance robustness. The training data includes a wide range of object types, such as bipedal, quadrupedal, avian, marine, insectoid, and manipulable rigid objects, with diverse initial poses.

Experiments demonstrate that RigAnything achieves state-of-the-art performance in the auto-rigging task, surpassing prior methods in quality, robustness, generalizability, and efficiency. Quantitative evaluations on the RigNet dataset show improvements in Intersection over Union (IoU), Precision, Recall, Chamfer distances for joints (CD-J2J), bone line segments (CD-B2B), and joint-to-bone line segments (CD-J2B). Ablation studies validate the impact of joint diffusion modeling, normal injection, and pose augmentation on skeleton prediction. Specifically, using a deterministic L2 loss instead of joint diffusion results in joints collapsing toward the middle axis, while the full model captures diverse joint position modalities. Quantitative results show that joint diffusion modeling improves the skeleton IoU by almost two times.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/Isabella__Liu/status/1890461204038185131

https://twitter.com/Melinaleft/status/1890556870856769694

https://twitter.com/ArxivToday/status/1890445711734399007

https://twitter.com/Chandra88Moon/status/1890707242623648206