RigAnything: Template-Free Autoregressive Rigging for Diverse 3D Assets
(2502.09615v1)
Published 13 Feb 2025 in cs.CV
Abstract: We present RigAnything, a novel autoregressive transformer-based model, which makes 3D assets rig-ready by probabilistically generating joints, skeleton topologies, and assigning skinning weights in a template-free manner. Unlike most existing auto-rigging methods, which rely on predefined skeleton template and are limited to specific categories like humanoid, RigAnything approaches the rigging problem in an autoregressive manner, iteratively predicting the next joint based on the global input shape and the previous prediction. While autoregressive models are typically used to generate sequential data, RigAnything extends their application to effectively learn and represent skeletons, which are inherently tree structures. To achieve this, we organize the joints in a breadth-first search (BFS) order, enabling the skeleton to be defined as a sequence of 3D locations and the parent index. Furthermore, our model improves the accuracy of position prediction by leveraging diffusion modeling, ensuring precise and consistent placement of joints within the hierarchy. This formulation allows the autoregressive model to efficiently capture both spatial and hierarchical relationships within the skeleton. Trained end-to-end on both RigNet and Objaverse datasets, RigAnything demonstrates state-of-the-art performance across diverse object types, including humanoids, quadrupeds, marine creatures, insects, and many more, surpassing prior methods in quality, robustness, generalizability, and efficiency. Please check our website for more details: https://www.liuisabella.com/RigAnything.
Summary
The paper introduces RigAnything, a template-free autoregressive model that generates complete 3D asset rigs by iteratively predicting joints, skeleton topology, and skinning weights.
It uses an autoregressive transformer architecture, representing the skeleton via a BFS order and applying diffusion modeling for precise joint position prediction, achieving state-of-the-art auto-rigging performance.
Unlike previous methods restricted to specific categories, RigAnything is trained on diverse datasets like Objaverse to rig a wide range of 3D asset types, demonstrating broad generalizability.
The paper "RigAnything: Template-Free Autoregressive Rigging \ for Diverse 3D Assets" (2502.09615) introduces RigAnything, an autoregressive transformer-based model for automatic rigging of 3D assets. The model generates joints, skeleton topologies, and skinning weights without relying on predefined templates. RigAnything addresses the limitations of existing auto-rigging methods that are often restricted to specific categories like humanoids by approaching the rigging problem in an autoregressive manner. It iteratively predicts the next joint based on the global input shape and previous predictions. The method extends the application of autoregressive models to learn and represent skeletons, which are inherently tree structures, by organizing the joints in a breadth-first search (BFS) order. The model leverages diffusion modeling to improve the accuracy of position prediction, ensuring precise and consistent placement of joints within the hierarchy.
Key aspects of the method include:
Autoregressive Skeleton Prediction: The tree-structured skeleton is represented as a sequence using a BFS order, where each joint is defined by a 3D position jk∈R3 and a parent index pk. The joint probability of the skeleton J given the input shape S is factorized using the chain rule:
P(J∣S)=k=1∏KP(jk,pk∣J1:k−1,S)
J: Skeleton
S: Input shape
jk: 3D position of the k-th joint
pk: Parent index of the k-th joint
K: Total number of joints
J1:k−1: Sublist of J up to the k-th element
The conditional distribution of each joint position jk and parent index pk is predicted iteratively:
Joint Prediction with Diffusion Model: A diffusion sampling process is employed to predict continuously valued joint positions. The forward diffusion process gradually adds Gaussian noise to the ground-truth joint j0 over M time steps:
jm=αˉmj0+1−αˉmϵ,
where ϵ∼N(0,I) and αˉm=∏s=1mαs. A noise estimator ϵθ is trained to predict the added noise, conditioned on the diffusion time step m and the context Z. The training objective is:
Ljoint(Z,j0)=Eϵ,m[∥ϵ−ϵθ(jm∣m,Z)∥2].
At inference, the reverse process iteratively removes noise to sample the next joint position j0∼pθ(j0∣Z):
jm−1=αm1(jm−1−αˉm1−αmϵθ(jm∣m,Z))+σmδ,
where δ∼N(0,I) and σm denotes the noise level.
jm: Noisy version of the ground-truth joint
αˉm: Noise schedule
ϵ: Gaussian noise
ϵθ: Noise estimator
Z: Context, capturing both the evolving skeleton state and the input shape
αm: Parameter related to the noise schedule
σm: Noise level at step m
δ: Gaussian noise
Connectivity Prediction: After sampling the next joint position jk, the model predicts its connection to ancestor joints. A connectivity module C takes the updated context Zk′ and each predicted skeleton token Ti(i<k) to produce the parent joint probability:
qk=Softmax([C(Zk′,Ti)]i=1k−1).
The connectivity is supervised with the binary cross-entropy loss:
where qk,i is the i-th element in qk and y^k,i∈{0,1} is the ground-truth label.
Zk′: Updated context with the sampled joint jk
Ti: Predicted skeleton token
qk: Parent joint probability
qk,i: i-th element in qk
y^k,i: Ground-truth label indicating whether joint jk is connected to joint ji
Skinning Prediction: Skinning weights are described by a matrix W∈RL×K, where each element wlk indicates the influence of the k-th joint on the l-th surface point. A skinning prediction module G takes as input the shape token Hsl for point sl, along with the skeleton token Tk for each joint jk, and outputs a predicted influence score. The final skinning weight wl is computed using the softmax function:
wl=Softmax([G(Hsl,Tk)]k=1K).
The module is trained by minimizing a weighted cross-entropy loss:
wlk: Influence of the k-th joint on the l-th surface point
wl: Weight vector for each surface point
Hsl: Shape token for point sl
Tk: Skeleton token for each joint jk
w^l,k: Ground-truth skinning weight
Autoregressive Transformer Architecture: The model uses a transformer-based architecture to output shape tokens H∈RL×d and skeleton tokens T1:k∈Rk×d, which serve as conditional inputs for skeleton and skinning prediction. The transformer processes these tokens through a series of attention blocks with attention masking to obtain the final shape and skeleton tokens. A hybrid attention mechanism is used, where shape tokens attend to each other via full self-attention, and skeleton tokens attend to all shape tokens and apply causal attention among themselves.
The model is trained end-to-end on the RigNet [rignet] and Objaverse [deitke2023objaverse] datasets. The Objaverse dataset was filtered to select 9686 high-quality rigged shapes. Input shapes are augmented with random pose variations to enhance robustness. The training data includes a wide range of object types, such as bipedal, quadrupedal, avian, marine, insectoid, and manipulable rigid objects, with diverse initial poses.
Experiments demonstrate that RigAnything achieves state-of-the-art performance in the auto-rigging task, surpassing prior methods in quality, robustness, generalizability, and efficiency. Quantitative evaluations on the RigNet dataset show improvements in Intersection over Union (IoU), Precision, Recall, Chamfer distances for joints (CD-J2J), bone line segments (CD-B2B), and joint-to-bone line segments (CD-J2B). Ablation studies validate the impact of joint diffusion modeling, normal injection, and pose augmentation on skeleton prediction. Specifically, using a deterministic L2 loss instead of joint diffusion results in joints collapsing toward the middle axis, while the full model captures diverse joint position modalities. Quantitative results show that joint diffusion modeling improves the skeleton IoU by almost two times.