Puppeteer: Rig and Animate Your 3D Models

Published 14 Aug 2025 in cs.CV and cs.GR | (2508.10898v1)

Abstract: Modern interactive applications increasingly demand dynamic 3D content, yet the transformation of static 3D models into animated assets constitutes a significant bottleneck in content creation pipelines. While recent advances in generative AI have revolutionized static 3D model creation, rigging and animation continue to depend heavily on expert intervention. We present Puppeteer, a comprehensive framework that addresses both automatic rigging and animation for diverse 3D objects. Our system first predicts plausible skeletal structures via an auto-regressive transformer that introduces a joint-based tokenization strategy for compact representation and a hierarchical ordering methodology with stochastic perturbation that enhances bidirectional learning capabilities. It then infers skinning weights via an attention-based architecture incorporating topology-aware joint attention that explicitly encodes inter-joint relationships based on skeletal graph distances. Finally, we complement these rigging advances with a differentiable optimization-based animation pipeline that generates stable, high-fidelity animations while being computationally more efficient than existing approaches. Extensive evaluations across multiple benchmarks demonstrate that our method significantly outperforms state-of-the-art techniques in both skeletal prediction accuracy and skinning quality. The system robustly processes diverse 3D content, ranging from professionally designed game assets to AI-generated shapes, producing temporally coherent animations that eliminate the jittering issues common in existing methods.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a unified pipeline automating both rigging and animation of diverse 3D models.
It employs transformer-based auto-regressive skeleton generation and topology-aware skinning weight prediction to enhance accuracy and efficiency.
Differentiable optimization with video guidance produces artifact-free animations that outperform previous state-of-the-art methods.

Puppeteer: Unified Automatic Rigging and Animation for 3D Models

Motivation and Context

The transformation of static 3D models into animated assets remains a major bottleneck in digital content creation pipelines, despite recent advances in generative AI for static 3D geometry and texture synthesis. Manual rigging (skeleton setup and skinning weight assignment) and animation require significant expertise and time, limiting scalability and accessibility. Existing automated rigging methods either rely on template fitting, which lacks generalization, or template-free approaches that often produce impractical skeletons. Deep learning-based methods such as RigNet and MagicArticulate have advanced the field, but suffer from limitations in handling complex topologies, computational efficiency, and generalization. Critically, prior work has focused almost exclusively on rigging, leaving animation as a separate, manual process.

Contributions

Puppeteer introduces a unified, fully automatic pipeline for both rigging and animation of diverse 3D models. The framework is built on four key innovations:

Articulation-XL2.0 Dataset: An expanded dataset of 59.4k rigged models, including 11.4k diverse-pose examples, enabling robust learning and generalization to varied object categories and poses.
Auto-regressive Skeleton Generation: A transformer-based model with joint-based tokenization and hierarchical sequence ordering with randomization, yielding compact, structurally coherent skeletons without template dependencies.
Attention-based Skinning Weight Prediction: An architecture incorporating topology-aware joint attention, explicitly encoding skeletal graph structure for robust, efficient skinning weight inference.
Differentiable Optimization-based Animation: A parameter-free optimization pipeline that leverages reference video guidance to produce stable, high-fidelity animations, eliminating the jittering artifacts common in learning-based approaches.

Technical Approach

Dataset Construction

Articulation-XL2.0 is curated from Objaverse-XL, focusing on high-quality rigged assets and diverse pose configurations. The diverse-pose subset is constructed by extracting frames with maximal deviation from rest pose in animation data and supplementing with synthetic animal poses via SMALR parameterizations. This dataset supports both robust training and rigorous evaluation of generalization.

Skeleton Generation

Skeleton generation is formulated as a shape-conditioned sequence modeling problem. The pipeline consists of:

Joint-based Tokenization: Each joint is represented by its 3D coordinates and parent index, discretized into a $128^3$ grid, yielding a compact sequence of length $4j$ for $j$ joints.
Hierarchical Sequence Ordering with Randomization: Tokens are ordered via breadth-first traversal of the skeletal tree, with randomization applied to enhance bidirectional learning. Target-aware positional indicators guide the generation process.
Shape-conditioned Auto-regressive Transformer: Surface points with normals are encoded as shape tokens, preceding skeleton tokens in the transformer. The OPT-350M architecture is used, trained with cross-entropy loss for next-token prediction.

This approach avoids template dependencies and produces structurally valid skeletons across diverse object categories.

Skinning Weight Prediction

The attention-based network for skinning weight prediction operates as follows:

Feature Extraction: Mesh points are encoded with part-aware features (via PartField) and positional encoding. Bone embeddings are constructed by concatenating each joint’s parent and own position.
Attention Operations: The network applies topology-aware joint self-attention, global context integration via cross-attention with shape latents, bone-point cross-attention, and point feature refinement.
Topology-aware Joint Attention (TAJA): Self-attention is augmented with relative positional encodings derived from skeletal graph distances, improving the modeling of inter-joint relationships.
Skinning Weight Computation: Cosine similarity and softmax normalization yield per-vertex skinning weights.

This architecture achieves robust, generalizable skinning weight prediction, outperforming GNN-based and functional diffusion methods in both accuracy and efficiency.

Video-guided Animation

Animation is performed via a differentiable optimization framework:

Reference Video Generation: The rigged mesh is rendered as the initial frame, and text-to-video models (e.g., Kling AI, JiMeng AI) generate plausible motion sequences.
Optimization Variables: For each frame, root motion (quaternion and translation) and joint-specific rotations are optimized.
Loss Functions: Rendering losses (RGB, mask, optical flow, depth) are computed via differentiable rendering (PyTorch3D). Tracking losses use Cotracker3 to align projected 3D joints and vertices with tracked 2D keypoints, with visibility masks ensuring consistency. Regularization enforces motion smoothness.
Forward Kinematics and Linear Blend Skinning: Optimized transformations are applied via FK and LBS to produce the final animated mesh sequence.

This approach is computationally efficient (20–90 minutes per object, depending on mesh complexity and frame count) and scales linearly with video length and mesh size.

Experimental Results

Rigging

Skeleton Generation: Puppeteer achieves lower Chamfer Distance metrics (CD-J2J, CD-J2B, CD-B2B) than Pinocchio, RigNet, MagicArticulate, and UniRig across Articulation-XL2.0, ModelsResource, and diverse-pose benchmarks. Incorporation of the diverse-pose subset further improves generalization.
Skinning Weight Prediction: Puppeteer outperforms GVB, RigNet, and MagicArticulate in precision, recall, and L1-norm error. Inference is significantly faster (up to 59 $\times$ over GVB). Deformation error analysis confirms superior practical efficacy.

Animation

Video-guided Animation: Compared to L4GM and MotionDreamer, Puppeteer produces temporally consistent, artifact-free animations, preserving geometry and motion quality. User studies confirm superior video-animation alignment, motion realism, and geometry preservation.

Ablation Studies

Skeleton Generation: Pose augmentation, order randomization, joint-based tokenization, and hierarchical ordering are all critical for optimal performance.
Skinning Weight Prediction: Bone embeddings, TAJA, part-aware features, and pose augmentation each contribute significantly to accuracy and generalization.

Limitations and Future Directions

Puppeteer does not capture fine-scale deformations (e.g., hair, cloth) due to sparse joint density in highly deformable regions. The animation stage relies on per-scene optimization, precluding real-time deployment. Future work may explore animation-driven joint refinement, end-to-end feed-forward animation models, adaptive frame sampling for complex motion, and multi-view priors to address occlusion.

Implications

Puppeteer democratizes 3D animation by removing the need for specialized expertise, enabling scalable, accessible content creation for diverse users. The release of Articulation-XL2.0 will facilitate further research in automated rigging and animation. However, increased accessibility raises concerns about potential misuse and impacts on traditional animation employment, necessitating responsible deployment.

Conclusion

Puppeteer presents a unified, scalable solution for automatic rigging and animation of 3D models, integrating transformer-based skeleton generation, topology-aware skinning weight prediction, and efficient optimization-based animation. Extensive empirical results demonstrate state-of-the-art performance in skeleton fidelity, skinning accuracy, and animation smoothness across diverse benchmarks. The framework and dataset set a new standard for automated 3D content creation and open avenues for future research in real-time, fine-grained animation and broader application domains.