- The paper introduces a unified pipeline automating both rigging and animation of diverse 3D models.
- It employs transformer-based auto-regressive skeleton generation and topology-aware skinning weight prediction to enhance accuracy and efficiency.
- Differentiable optimization with video guidance produces artifact-free animations that outperform previous state-of-the-art methods.
Puppeteer: Unified Automatic Rigging and Animation for 3D Models
Motivation and Context
The transformation of static 3D models into animated assets remains a major bottleneck in digital content creation pipelines, despite recent advances in generative AI for static 3D geometry and texture synthesis. Manual rigging (skeleton setup and skinning weight assignment) and animation require significant expertise and time, limiting scalability and accessibility. Existing automated rigging methods either rely on template fitting, which lacks generalization, or template-free approaches that often produce impractical skeletons. Deep learning-based methods such as RigNet and MagicArticulate have advanced the field, but suffer from limitations in handling complex topologies, computational efficiency, and generalization. Critically, prior work has focused almost exclusively on rigging, leaving animation as a separate, manual process.
Contributions
Puppeteer introduces a unified, fully automatic pipeline for both rigging and animation of diverse 3D models. The framework is built on four key innovations:
- Articulation-XL2.0 Dataset: An expanded dataset of 59.4k rigged models, including 11.4k diverse-pose examples, enabling robust learning and generalization to varied object categories and poses.
- Auto-regressive Skeleton Generation: A transformer-based model with joint-based tokenization and hierarchical sequence ordering with randomization, yielding compact, structurally coherent skeletons without template dependencies.
- Attention-based Skinning Weight Prediction: An architecture incorporating topology-aware joint attention, explicitly encoding skeletal graph structure for robust, efficient skinning weight inference.
- Differentiable Optimization-based Animation: A parameter-free optimization pipeline that leverages reference video guidance to produce stable, high-fidelity animations, eliminating the jittering artifacts common in learning-based approaches.
Technical Approach
Dataset Construction
Articulation-XL2.0 is curated from Objaverse-XL, focusing on high-quality rigged assets and diverse pose configurations. The diverse-pose subset is constructed by extracting frames with maximal deviation from rest pose in animation data and supplementing with synthetic animal poses via SMALR parameterizations. This dataset supports both robust training and rigorous evaluation of generalization.
Skeleton Generation
Skeleton generation is formulated as a shape-conditioned sequence modeling problem. The pipeline consists of:
- Joint-based Tokenization: Each joint is represented by its 3D coordinates and parent index, discretized into a 1283 grid, yielding a compact sequence of length $4j$ for j joints.
- Hierarchical Sequence Ordering with Randomization: Tokens are ordered via breadth-first traversal of the skeletal tree, with randomization applied to enhance bidirectional learning. Target-aware positional indicators guide the generation process.
- Shape-conditioned Auto-regressive Transformer: Surface points with normals are encoded as shape tokens, preceding skeleton tokens in the transformer. The OPT-350M architecture is used, trained with cross-entropy loss for next-token prediction.
This approach avoids template dependencies and produces structurally valid skeletons across diverse object categories.
Skinning Weight Prediction
The attention-based network for skinning weight prediction operates as follows:
- Feature Extraction: Mesh points are encoded with part-aware features (via PartField) and positional encoding. Bone embeddings are constructed by concatenating each joint’s parent and own position.
- Attention Operations: The network applies topology-aware joint self-attention, global context integration via cross-attention with shape latents, bone-point cross-attention, and point feature refinement.
- Topology-aware Joint Attention (TAJA): Self-attention is augmented with relative positional encodings derived from skeletal graph distances, improving the modeling of inter-joint relationships.
- Skinning Weight Computation: Cosine similarity and softmax normalization yield per-vertex skinning weights.
This architecture achieves robust, generalizable skinning weight prediction, outperforming GNN-based and functional diffusion methods in both accuracy and efficiency.
Video-guided Animation
Animation is performed via a differentiable optimization framework:
- Reference Video Generation: The rigged mesh is rendered as the initial frame, and text-to-video models (e.g., Kling AI, JiMeng AI) generate plausible motion sequences.
- Optimization Variables: For each frame, root motion (quaternion and translation) and joint-specific rotations are optimized.
- Loss Functions: Rendering losses (RGB, mask, optical flow, depth) are computed via differentiable rendering (PyTorch3D). Tracking losses use Cotracker3 to align projected 3D joints and vertices with tracked 2D keypoints, with visibility masks ensuring consistency. Regularization enforces motion smoothness.
- Forward Kinematics and Linear Blend Skinning: Optimized transformations are applied via FK and LBS to produce the final animated mesh sequence.
This approach is computationally efficient (20–90 minutes per object, depending on mesh complexity and frame count) and scales linearly with video length and mesh size.
Experimental Results
Rigging
- Skeleton Generation: Puppeteer achieves lower Chamfer Distance metrics (CD-J2J, CD-J2B, CD-B2B) than Pinocchio, RigNet, MagicArticulate, and UniRig across Articulation-XL2.0, ModelsResource, and diverse-pose benchmarks. Incorporation of the diverse-pose subset further improves generalization.
- Skinning Weight Prediction: Puppeteer outperforms GVB, RigNet, and MagicArticulate in precision, recall, and L1-norm error. Inference is significantly faster (up to 59× over GVB). Deformation error analysis confirms superior practical efficacy.
Animation
- Video-guided Animation: Compared to L4GM and MotionDreamer, Puppeteer produces temporally consistent, artifact-free animations, preserving geometry and motion quality. User studies confirm superior video-animation alignment, motion realism, and geometry preservation.
Ablation Studies
- Skeleton Generation: Pose augmentation, order randomization, joint-based tokenization, and hierarchical ordering are all critical for optimal performance.
- Skinning Weight Prediction: Bone embeddings, TAJA, part-aware features, and pose augmentation each contribute significantly to accuracy and generalization.
Limitations and Future Directions
Puppeteer does not capture fine-scale deformations (e.g., hair, cloth) due to sparse joint density in highly deformable regions. The animation stage relies on per-scene optimization, precluding real-time deployment. Future work may explore animation-driven joint refinement, end-to-end feed-forward animation models, adaptive frame sampling for complex motion, and multi-view priors to address occlusion.
Implications
Puppeteer democratizes 3D animation by removing the need for specialized expertise, enabling scalable, accessible content creation for diverse users. The release of Articulation-XL2.0 will facilitate further research in automated rigging and animation. However, increased accessibility raises concerns about potential misuse and impacts on traditional animation employment, necessitating responsible deployment.
Conclusion
Puppeteer presents a unified, scalable solution for automatic rigging and animation of 3D models, integrating transformer-based skeleton generation, topology-aware skinning weight prediction, and efficient optimization-based animation. Extensive empirical results demonstrate state-of-the-art performance in skeleton fidelity, skinning accuracy, and animation smoothness across diverse benchmarks. The framework and dataset set a new standard for automated 3D content creation and open avenues for future research in real-time, fine-grained animation and broader application domains.