Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

RigAnything: Template-Free Autoregressive Rigging for Diverse 3D Assets (2502.09615v1)

Published 13 Feb 2025 in cs.CV

Abstract: We present RigAnything, a novel autoregressive transformer-based model, which makes 3D assets rig-ready by probabilistically generating joints, skeleton topologies, and assigning skinning weights in a template-free manner. Unlike most existing auto-rigging methods, which rely on predefined skeleton template and are limited to specific categories like humanoid, RigAnything approaches the rigging problem in an autoregressive manner, iteratively predicting the next joint based on the global input shape and the previous prediction. While autoregressive models are typically used to generate sequential data, RigAnything extends their application to effectively learn and represent skeletons, which are inherently tree structures. To achieve this, we organize the joints in a breadth-first search (BFS) order, enabling the skeleton to be defined as a sequence of 3D locations and the parent index. Furthermore, our model improves the accuracy of position prediction by leveraging diffusion modeling, ensuring precise and consistent placement of joints within the hierarchy. This formulation allows the autoregressive model to efficiently capture both spatial and hierarchical relationships within the skeleton. Trained end-to-end on both RigNet and Objaverse datasets, RigAnything demonstrates state-of-the-art performance across diverse object types, including humanoids, quadrupeds, marine creatures, insects, and many more, surpassing prior methods in quality, robustness, generalizability, and efficiency. Please check our website for more details: https://www.liuisabella.com/RigAnything.

Summary

  • The paper introduces RigAnything, a template-free autoregressive model that generates complete 3D asset rigs by iteratively predicting joints, skeleton topology, and skinning weights.
  • It uses an autoregressive transformer architecture, representing the skeleton via a BFS order and applying diffusion modeling for precise joint position prediction, achieving state-of-the-art auto-rigging performance.
  • Unlike previous methods restricted to specific categories, RigAnything is trained on diverse datasets like Objaverse to rig a wide range of 3D asset types, demonstrating broad generalizability.

The paper "RigAnything: Template-Free Autoregressive Rigging \ for Diverse 3D Assets" (2502.09615) introduces RigAnything, an autoregressive transformer-based model for automatic rigging of 3D assets. The model generates joints, skeleton topologies, and skinning weights without relying on predefined templates. RigAnything addresses the limitations of existing auto-rigging methods that are often restricted to specific categories like humanoids by approaching the rigging problem in an autoregressive manner. It iteratively predicts the next joint based on the global input shape and previous predictions. The method extends the application of autoregressive models to learn and represent skeletons, which are inherently tree structures, by organizing the joints in a breadth-first search (BFS) order. The model leverages diffusion modeling to improve the accuracy of position prediction, ensuring precise and consistent placement of joints within the hierarchy.

Key aspects of the method include:

  • Autoregressive Skeleton Prediction: The tree-structured skeleton is represented as a sequence using a BFS order, where each joint is defined by a 3D position jkR3j_k \in \mathbb{R}^3 and a parent index pkp_k. The joint probability of the skeleton J\mathcal{J} given the input shape SS is factorized using the chain rule:

    P(JS)=k=1KP(jk,pkJ1:k1,S)P(\mathcal{J} \mid S) = \prod_{k=1}^{K} P\left(j_k, p_k \mid \mathcal{J}_{1:k-1}, S\right)

    • J\mathcal{J}: Skeleton
    • SS: Input shape
    • jkj_k: 3D position of the kk-th joint
    • pkp_k: Parent index of the kk-th joint
    • KK: Total number of joints
    • J1:k1\mathcal{J}_{1:k-1}: Sublist of J\mathcal{J} up to the kk-th element

    The conditional distribution of each joint position jkj_k and parent index pkp_k is predicted iteratively:

    P(jk,pkJ1:k1,S)=P(jkJ1:k1,S)P(pkjk,J1:k1,S)P\left(j_k, p_k \mid \mathcal{J}_{1:k-1}, S\right) = P\left(j_k\mid \mathcal{J}_{1:k-1}, S\right) P\left(p_k\mid j_k, \mathcal{J}_{1:k-1}, S\right)

  • Joint Prediction with Diffusion Model: A diffusion sampling process is employed to predict continuously valued joint positions. The forward diffusion process gradually adds Gaussian noise to the ground-truth joint j0j^0 over MM time steps:

    jm=αˉmj0+1αˉmϵj^m = \sqrt{\bar{\alpha}_m} j^0 + \sqrt{1 - \bar{\alpha}_m} \epsilon,

    where ϵN(0,I)\epsilon \sim \mathcal{N}(\mathbf{0}, \mathbf{I}) and αˉm=s=1mαs\bar{\alpha}_m = \prod_{s=1}^m \alpha_s. A noise estimator ϵθ\epsilon_\theta is trained to predict the added noise, conditioned on the diffusion time step mm and the context ZZ. The training objective is:

    Ljoint(Z,j0)=Eϵ,m[ϵϵθ(jmm,Z)2]\mathcal{L}_{\text{joint}}(Z, j^0) = \mathbb{E}_{\epsilon, m} \big[ \| \epsilon - \epsilon_\theta(j^m \mid m, Z) \|^2 \big].

    At inference, the reverse process iteratively removes noise to sample the next joint position j0pθ(j0Z)j^0 \sim p_\theta(j^0 \mid Z):

    jm1=1αm(jm1αm1αˉmϵθ(jmm,Z))+σmδj^{m-1} = \frac{1}{\sqrt{\alpha_m}} \big(j^m - \frac{1 - \alpha_m}{\sqrt{1 - \bar{\alpha}_m}} \epsilon_\theta(j^m \mid m, Z)\big) + \sigma_m \delta,

    where δN(0,I)\delta \sim \mathcal{N}(\mathbf{0}, \mathbf{I}) and σm\sigma_m denotes the noise level.

    • jmj^m: Noisy version of the ground-truth joint
    • αˉm\bar{\alpha}_m: Noise schedule
    • ϵ\epsilon: Gaussian noise
    • ϵθ\epsilon_\theta: Noise estimator
    • ZZ: Context, capturing both the evolving skeleton state and the input shape
    • αm\alpha_m: Parameter related to the noise schedule
    • σm\sigma_m: Noise level at step mm
    • δ\delta: Gaussian noise
  • Connectivity Prediction: After sampling the next joint position jkj_{k}, the model predicts its connection to ancestor joints. A connectivity module C takes the updated context ZkZ'_{k} and each predicted skeleton token Ti(i<k)T_i (i < k) to produce the parent joint probability:

    qk=Softmax([C(Zk,Ti)]i=1k1)\mathbf{q}_{k} = \text{Softmax}\Bigl ([ \text{C} (Z'_{k}, T_i) ]_{i=1}^{k-1} \Bigr ).

    The connectivity is supervised with the binary cross-entropy loss:

    $\mathcal{L}<em>{\text{connect} = - \sum</em>{i=1}<sup>{k-1}</sup> \bigl[ \hat{y}<em>{k,i} \log\bigl(q</em>{k,i}\bigr) \;+\; \bigl(1 - \hat{y}<em>{k,i}\bigr) \log\bigl(1 - q</em>{k,i}\bigr) \bigr]$,

    where qk,iq_{k,i} is the ii-th element in qk\mathbf{q}_k and y^k,i{0,1}\hat{y}_{k,i} \in \{0, 1 \} is the ground-truth label.

    • ZkZ'_k: Updated context with the sampled joint jkj_k
    • TiT_i: Predicted skeleton token
    • qk\mathbf{q}_k: Parent joint probability
    • qk,iq_{k,i}: ii-th element in qk\mathbf{q}_k
    • y^k,i\hat{y}_{k,i}: Ground-truth label indicating whether joint jkj_{k} is connected to joint jij_i
  • Skinning Prediction: Skinning weights are described by a matrix WRL×KW \in \mathbb{R}^{L \times K}, where each element wlkw_{lk} indicates the influence of the kk-th joint on the ll-th surface point. A skinning prediction module G takes as input the shape token HslH_{s_l} for point sls_l, along with the skeleton token TkT_k for each joint jkj_k, and outputs a predicted influence score. The final skinning weight wl\mathbf{w}_l is computed using the softmax function:

    wl=Softmax([G(Hsl,Tk)]k=1K)\mathbf{w}_l = \text{Softmax}\Bigl([\text{G}(H_{s_l}, T_k)]_{k=1}^{K}\Bigr).

    The module is trained by minimizing a weighted cross-entropy loss:

    (\mathcal{L}{\text{skinning} = \frac{1}{L} \sum{l=1}{L} \Bigl(- \sum_{k=1}{K} \hat{w}{l,k} \,\log\bigl(w{l,k}\bigr)\Bigr)).

    • WW: Skinning weights matrix
    • wlkw_{lk}: Influence of the kk-th joint on the ll-th surface point
    • wl\mathbf{w}_l: Weight vector for each surface point
    • HslH_{s_l}: Shape token for point sls_l
    • TkT_k: Skeleton token for each joint jkj_k
    • w^l,k\hat{w}_{l,k}: Ground-truth skinning weight
  • Autoregressive Transformer Architecture: The model uses a transformer-based architecture to output shape tokens HRL×dH\in \mathbb{R}^{L \times d} and skeleton tokens T1:kRk×dT_{1:k}\in\mathbb{R}^{k\times d}, which serve as conditional inputs for skeleton and skinning prediction. The transformer processes these tokens through a series of attention blocks with attention masking to obtain the final shape and skeleton tokens. A hybrid attention mechanism is used, where shape tokens attend to each other via full self-attention, and skeleton tokens attend to all shape tokens and apply causal attention among themselves.

The model is trained end-to-end on the RigNet [rignet] and Objaverse [deitke2023objaverse] datasets. The Objaverse dataset was filtered to select 9686 high-quality rigged shapes. Input shapes are augmented with random pose variations to enhance robustness. The training data includes a wide range of object types, such as bipedal, quadrupedal, avian, marine, insectoid, and manipulable rigid objects, with diverse initial poses.

Experiments demonstrate that RigAnything achieves state-of-the-art performance in the auto-rigging task, surpassing prior methods in quality, robustness, generalizability, and efficiency. Quantitative evaluations on the RigNet dataset show improvements in Intersection over Union (IoU), Precision, Recall, Chamfer distances for joints (CD-J2J), bone line segments (CD-B2B), and joint-to-bone line segments (CD-J2B). Ablation studies validate the impact of joint diffusion modeling, normal injection, and pose augmentation on skeleton prediction. Specifically, using a deterministic L2 loss instead of joint diffusion results in joints collapsing toward the middle axis, while the full model captures diverse joint position modalities. Quantitative results show that joint diffusion modeling improves the skeleton IoU by almost two times.