- The paper presents a novel autoregressive tokenization that generates topologically valid skeletons with a ~30% reduction in sequence length.
- It employs cross-attention between point cloud and bone features to accurately predict skin weights and relevant physics parameters.
- UniRig outperforms current methods on multiple benchmark datasets, streamlining the rigging process for diverse 3D models.
This paper introduces UniRig, a novel, unified framework designed to automatically generate both skeletal structures (rigs) and associated skinning weights for diverse 3D models (2504.12451). The core problem addressed is the bottleneck created by manual rigging in 3D animation pipelines, especially given the increasing volume of AI-generated 3D content. Existing methods often struggle with diverse model types or produce topologically invalid skeletons.
UniRig Framework Overview:
UniRig employs a two-stage process:
- Autoregressive Skeleton Tree Generation:
- Input: A point cloud (65,536 points with normals) sampled from the input 3D mesh.
- Geometry Encoding: An encoder (based on 3DShape2Vecset (2305.12714)) processes the point cloud to extract geometric features (FG).
- Skeleton Tree Tokenization: A novel method converts the hierarchical skeleton structure (joint positions J and parent relationships P) into a compact sequence of discrete tokens (S). This involves:
- Discretizing normalized bone coordinates ([−1,1]) into 256 bins.
- Using special tokens to denote bone types (
<spring_bone>
, <mixamo:body>
), structure (<branch_token>
), and class (<cls>
). This reduces redundancy compared to naive coordinate concatenation.
- Employing Depth-First Search (DFS) and sorting child bones for consistent ordering.
- This optimized tokenization reduces sequence length by ~30% compared to naive methods (Table V).
- Autoregressive Prediction: An OPT-125M transformer model (2205.01068) takes the geometric features FG and the preceding tokens as input to predict the next token in the sequence S, trained using a Next Token Prediction loss. This ensures topologically valid skeleton generation.
- Output: A sequence of tokens decoded back into a skeleton tree ({J,P}).
- Skin Weight Prediction via Bone-Point Cross Attention:
- Input: The predicted skeleton tree and the input point cloud.
- Encoders:
- A Point-wise Encoder (pretrained Point Transformer V3 / SAMPart3D (2411.07184, 2404.19720)) extracts per-point features (FP).
- A Bone Encoder (MLP with positional encoding) processes bone head/tail coordinates to get bone features (FB).
- Cross Attention:
- For Skinning Weights: Point features (FP) act as queries, bone features (FB) as keys/values. Attention weights are computed, concatenated with precomputed geodesic distances (D), and passed through an MLP (EW) and softmax to predict skinning weights (W∈RN×J).
- For Bone Attributes: Roles are reversed (Bone features as query, Point features as key/value). An MLP (EA) predicts bone attributes (A, e.g., physics parameters).
- Loss: A combination of KL divergence for skin weights (LKL) and L2 loss for bone attributes (L2).
Training and Datasets:
- Rig-XL Dataset: A new large-scale dataset curated by the authors, containing 14,611 diverse 3D models (filtered from Objaverse-XL (2307.15880) / Diffusion4D (2405.16645)) with cleaned skeletons and skinning weights. Preprocessing involved filtering, automated categorization using a VLM (GPT-4o (2410.21276)), and manual verification.
- VRoid Dataset: 2,061 anime-style character models from VRoidHub (2112.11686) used for fine-tuning, especially for handling details and spring bones.
- Training Strategy (Skeletal Equivalence): To handle varying bone influence and sampling density:
- Randomly freeze a subset of bones during training iterations.
- Normalize the loss contribution of each bone based on the number of vertices it influences (bone-centric loss normalization).
- Indirect Supervision (Physical Simulation): To improve motion realism, especially for spring bones:
- A differentiable Verlet integration physics simulation (based on VRM standard (2112.11686)) is used.
- Short motion sequences are applied to meshes deformed by both predicted and ground-truth parameters.
- An L2 loss between the resulting vertex positions (L2(XM,XpredM)) is added to the overall loss, guiding the model towards physically plausible weights and attributes.
Implementation Details:
- Skeleton Prediction: OPT-125M + 3DShape2Vecset-style encoder, trained for 3 days on 8 A100s.
- Skin Weight Prediction: Pretrained Point Transformer V3 (frozen) + trainable Bone Encoder, Cross Attention, MLP decoders. Trained for 1 day on 8 A100s.
- Data Augmentation: Random rotation/scaling, applying random Mixamo motions or bone rotations.
- Dataset Sampling: Adjusted sampling probabilities to balance categories during training.
Results:
- Quantitative: UniRig significantly outperforms RigNet (2005.00559), NBS (2108.07061), and TA-Rig (2305.12678) on skeleton prediction metrics (J2J, J2B, B2B Chamfer Distance) and skinning weight prediction (L1 loss) across Mixamo, VRoid, and Rig-XL validation sets (Tables VI, VIII). It also shows lower mesh deformation error under animation (Table IX).
- Qualitative: Visual comparisons show UniRig produces more detailed and accurate skeletons than academic and commercial methods (Meshy, Anything World, Accurig, Tripo) across diverse models (Figures 6, 7, 8, 11, 12, 13). It accurately predicts weights for fine details (hands, hair) and handles spring bone physics for realistic motion (Figure 9).
- Ablation Studies: Confirm the benefits of the optimized tokenization, indirect physical supervision, and skeletal equivalence training strategy (Tables VII, X, XI, XII).
Practical Applications:
- Automated Rigging: Speeds up the rigging process for diverse 3D models.
- Human-Assisted Auto-Rigging: Allows users to edit the predicted skeleton (e.g., add/remove bones) and regenerate the rig, combining automation with manual control (Figure 10).
- Character Animation: Predicts physics parameters (spring bones), enabling direct use in animation software (e.g., Warudo (2407.15439)) for applications like VTubing (Figure 11).
Limitations:
- Performance depends on the diversity of the training data (Rig-XL). May struggle with highly out-of-distribution models.
Conclusion:
UniRig presents a significant advancement in automatic rigging by using an autoregressive approach with novel tokenization for skeleton generation and a cross-attention mechanism for skinning. Combined with the large Rig-XL dataset and specialized training strategies, it achieves state-of-the-art performance and offers practical benefits like diversity handling, human-in-the-loop refinement, and animation-ready output.