- The paper introduces an auto-regressive transformer that treats skeleton generation as a sequence modeling task to effectively handle variable bone counts.
- The paper employs a functional diffusion process with volumetric geodesic priors to accurately predict skinning weights for seamless mesh deformation.
- The paper presents the Articulation-XL dataset of over 33,000 diverse 3D models, enabling robust training and superior generalization across complex topologies.
MagicArticulate is a framework designed to automatically convert static 3D models into articulation-ready assets, enabling realistic animation. Traditional methods for this task are manual, time-consuming, and require specialized expertise. Existing automatic or learning-based approaches are often limited by the lack of large-scale, diverse datasets and struggle with varied object structures and complex mesh topologies.
To address these challenges, the paper introduces three key contributions:
- Articulation-XL: A large-scale benchmark dataset containing over 33,000 3D models with high-quality articulation annotations, curated from Objaverse-XL (2502.12135). This dataset is crucial for training generalizable learning-based models. It includes initial filtering, VLM-based quality filtering using GPT-4o, and VLM-based category annotation to ensure diversity and quality across various object types, with bone counts ranging from 2 to 100.
- Auto-regressive Skeleton Generation: A novel method that formulates skeleton generation as a sequence modeling problem using a decoder-only transformer (specifically, the OPT-350M model). This approach handles the inherent variability in the number of bones and joints across different 3D models. The input mesh is first represented as a sampled point cloud (8,192 points are found effective), which is then encoded into a fixed-length feature sequence using a pre-trained shape encoder. This shape encoding is prepended to a sequence representing the skeleton's bones, where each bone is tokenized by its two joint coordinates discretized in a 1283 space. The model learns to generate this token sequence auto-regressively, conditioned on the shape tokens. Two sequence ordering strategies for bones are explored: spatial (based on sorted joint coordinates) and hierarchical (based on parent-child relationships), with spatial ordering yielding slightly better quantitative results in experiments.
- Functional Diffusion for Skinning Weight Prediction: A method to predict skinning weights, which define how mesh vertices are influenced by joints, using a functional diffusion process. Skinning weights are treated as a continuous function over the mesh surface. The model takes sampled points from the mesh surface as input and predicts an n-dimensional skinning weight vector for each point, where n is the maximum number of joints. To improve accuracy and stability, the process is conditioned on global shape features (from the same pre-trained encoder used for skeleton generation) and joint coordinates. Crucially, it incorporates volumetric geodesic distance priors between vertices and joints, learning to predict the residual between the ground truth weights and these geometric priors. A DDPM scheduler is used, and weights and geodesic priors are normalized to [−1,1] before noise is added.
For practical implementation, the framework operates in two stages. First, the auto-regressive transformer generates the skeleton structure (joint locations and bone connectivity). Second, the functional diffusion model predicts the skinning weights, conditioned on the input mesh geometry and the generated skeleton. The skeleton generation model is trained using a cross-entropy loss for next-token prediction. The skinning weight model is trained using an L2 loss on the predicted function output (the residual).
The framework is evaluated on both the newly introduced Articulation-XL dataset and the smaller ModelsResource dataset. Experiments show that MagicArticulate significantly outperforms existing methods like Pinocchio (template-based) and RigNet (learning-based) in both skeleton generation (measured by Chamfer Distance metrics: CD-J2J, CD-J2B, CD-B2B) and skinning weight prediction (measured by precision, recall, L1-norm error, and deformation error). The auto-regressive nature and functional diffusion approach allow MagicArticulate to generalize better to diverse object categories and varying mesh orientations compared to baselines that struggle with complex topologies or depend on consistent input orientation. Cross-dataset evaluation confirms its superior generalization ability on unseen data distributions, including AI-generated meshes and 3D scans.
Ablation studies validate the importance of key components, including VLM-based data filtering for skeleton generation quality, the number of sampled points for shape conditioning, the incorporation of volumetric geodesic distance priors, normalization strategies, and the use of global shape features for skinning weight prediction accuracy.
The resulting articulated models can be exported in standard formats (like FBX, GLB) and used directly in animation software such as Blender or Autodesk Maya, making the framework valuable for large-scale 3D content creation pipelines in gaming, VR/AR, and robotics. While robust, the method currently faces limitations with very coarse mesh inputs and dataset coverage for certain common articulated objects, suggesting avenues for future work.