Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

97 tokens/sec

GPT-4o

11 tokens/sec

Gemini 2.5 Pro Pro

47 tokens/sec

o3 Pro

5 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

AnimaX: 3D Diffusion & Skeleton Animation

Updated 14 July 2025

AnimaX is a 3D animation framework that fuses video diffusion motion priors with skeleton-based control to animate arbitrary 3D meshes.
It employs a joint video–pose diffusion method with shared positional encodings and modality-aware embeddings to ensure accurate temporal-spatial alignment.
Trained on over 160,000 rigged animations, AnimaX achieves state-of-the-art fidelity and efficiency for scalable applications in gaming, film, and VR.

AnimaX is a 3D animation framework that fuses the motion priors of large-scale video diffusion models with the controllability of skeleton-based animation, providing a feed-forward, category-agnostic solution for animating arbitrary 3D articulated meshes. Unlike traditional techniques that are constrained by fixed skeletal topologies or are computationally expensive due to high-dimensional optimization, AnimaX effectively transfers expressive motion knowledge from 2D video domains into physically consistent 3D mesh deformations. The framework introduces a joint video–pose diffusion methodology, shared positional encodings for cross-modal temporal-spatial alignment, and modality-aware embeddings, supporting diverse mesh types and arbitrary skeletal structures. Trained on a rigorously curated dataset of over 160,000 rigged animation sequences, AnimaX attains state-of-the-art results on generalization, motion fidelity, and efficiency as benchmarked by VBench, and is publicly available for further research and development (2506.19851).

1. Framework Architecture and Methodology

AnimaX operates through a two-stage pipeline:

Joint Multi-View Video–Pose Diffusion: The pipeline begins by receiving input comprising template renderings of the target 3D mesh. These renderings include standard multi-view RGB images and 2D pose maps where each skeletal joint is depicted by a uniquely colored circular marker. Alongside a textual motion prompt (describing the intended motion sequence), a diffusion model is trained to jointly generate synchronized multi-view RGB video sequences and corresponding pose maps. This design choice ensures the diffusion model learns not only the appearance transformation but also the structural dynamics of the skeleton.
3D Joint Position Reconstruction and Mesh Animation: The output 2D pose maps from multiple camera viewpoints are triangulated to yield consistent 3D joint positions by minimizing the multi-view reprojection error (with an additional loss term for bone length preservation). The resulting 3D positions are then used in an inverse kinematics (IK) process to convert joint trajectories into mesh rotations, thus animating the 3D model in a physically plausible fashion.

This differs from prior approaches that either optimize directly over high-dimensional mesh deformation fields—which is costly—or are restricted to fixed, pre-defined skeleton types, hindering generalization.

2. Technical Innovations and Alignment Mechanisms

Multiple architectural components enable AnimaX to bridge the modality gap between 2D diffused motion and 3D animation:

Joint Video–Pose Diffusion:

Rather than independently predicting sparse pose trajectories (which can lead to loss of motion detail or collapse), the network is trained to output both RGB video and pose map sequences in a unified token space. This leverages pre-trained video diffusion backbones for temporal coherence and motion richness.

Shared Positional Encodings:

To facilitate correct alignment between the modalities, spatial and temporal positional encodings are shared across corresponding tokens in the RGB and pose streams. For a token at index $(i, j, k)$ in the RGB branch,

$PE^{(i,j,k)} = PE^{(i + (f+2), j, k)} = R(i,j,k)$

where $R(i,j,k)$ is a RoPE-format rotation matrix. This guarantees spatial-temporal token correspondence for both visual and pose representations.

Modality-Aware Embeddings:

The addition of constant identifiers (passed through frequency encoding and linear networks) distinguishes between RGB and pose tokens, improving modality alignment and making the cross-attention layers aware of the token sources.

Camera Conditioning and Multi-View Self-Attention:

Plücker ray maps encode the relative camera poses of all rendered views. Multi-view consistency is enforced via self-attention across view-specific token representations, ensuring generated motions are coherent from all perspectives.

3. 3D Motion Representation via Multi-View 2D Pose Projections

AnimaX models 3D motion by projecting each mesh joint onto the image planes of multiple virtual cameras, placing a color-coded circular marker at the projection location per frame. This sparse yet expressive representation enables efficient and robust estimation of 3D trajectories via triangulation. The process solves a non-linear least-squares optimization:

Objective: Minimize joint reprojection error across all cameras while preserving bone lengths.
Result: Accurate and temporally smooth 3D joint positions, which drive the downstream mesh animation with near-kinematic fidelity.

This representation exploits the strengths of both low-level skeletal animation (control and modularity) and high-level diffused motion patterns (expressiveness and diversity).

4. Dataset Curation and Training Procedure

AnimaX's learning process is founded on a newly curated dataset of approximately 161,023 rigged animation samples sourced from datasets such as Objaverse, Mixamo, and VRoid.

Diversity: The dataset covers a broad taxonomy, including humanoid, animal, anthropomorphic, and inanimate (e.g., furniture) meshes.
Rendering Protocol: For each animation sample, multi-view videos and 2D pose maps are rendered in several camera configurations, paired with descriptive captions from a vision-LLM.
Training Strategy:
- First, a single-view video–pose diffusion model is fine-tuned via LoRA adaptation.
- Next, with backbone weights frozen, additional modules (camera embeddings and multi-view attention) are trained to extend the architecture to robustly handle multiple viewpoints.

This large-scale, structurally balanced dataset is fundamental to the model’s ability to generalize across arbitrary mesh identities and skeletons.

5. Performance Evaluation and Benchmarking

AnimaX is rigorously evaluated on the VBench suite, which measures several aspects of animation quality:

I2V Subject: Assessing fidelity to the input image and prompt.
Motion Smoothness: Evaluating temporal coherence and absence of artefacts.
Dynamic Degree and Failure Rate: Quantifying the richness of motion generation and frequency of near-static (failure) outputs.
Aesthetic Quality: Judging the appeal and visual plausibility of the rendered animations.

On all metrics, AnimaX demonstrates state-of-the-art performance relative to contemporary methods. Six-minute average inference time per animation illustrates practical scalability for modern production pipelines. User studies further validate superiority in motion–text congruence, shape preservation, and overall visual quality.

6. Applications, Generalization, and Scalability

The category-agnostic design of AnimaX enables direct application across domains:

Video Game Development: Automated rigged character animation for stylized and realistic avatars.
Film and Broadcast: Flexible tool for generating 3D animated sequences from mesh templates and textual prompts, bypassing lengthy manual rigging.
Virtual and Augmented Reality: Supports real-time or near-real-time 3D avatar animation in immersive settings.
Digital Content Production: Efficient pipeline for broadcast graphics, advertising, or educational content requiring lifelike or novel mesh animations.

Feed-forward inference allows scalable batch generation, and the system adapts to new asset types via template conditioning and minimal dataset-specific engineering.

7. Limitations and Future Work

Several frontier challenges are identified by the authors for subsequent investigation:

Dynamic Camera Control: The current method assumes a static, fixed set of rendered views. Extension to moving cameras or larger viewpoint diversity could improve realism for action-intensive or cinematographic sequences.
Long-Form Animation: While effective for short-to-medium sequences, generating temporally consistent long-form animations remains constrained by the underlying diffusion backbone. Proposed solutions include test-time adaptation and autoregressive denoising.
Expressiveness and Robustness: Enhanced conditioning inputs (detailed text, environmental cues), further development in modality alignment strategies, and training on yet more diverse datasets may increase both the diversity and physical plausibility of resulting movements.

These limitations guide ongoing research, with the ultimate aim of achieving robust, open-category, high-fidelity 3D animation via minimal supervision and maximal controllability.

AnimaX represents a significant advancement in neural 3D animation, bridging video-based motion priors and the skeletal domain with an efficient, modular diffusion-based approach (2506.19851). Its scalability and adaptability make it a compelling foundation for next-generation content creation tools across entertainment, gaming, and digital reality applications.

PDF Markdown Chat (Upgrade)

References (1)

AnimaX: Animating the Inanimate in 3D with Joint Video-Pose Diffusion Models (2025)