AnimaX: Animating the Inanimate in 3D with Joint Video-Pose Diffusion Models (2506.19851v1)

Published 24 Jun 2025 in cs.CV

Abstract: We present AnimaX, a feed-forward 3D animation framework that bridges the motion priors of video diffusion models with the controllable structure of skeleton-based animation. Traditional motion synthesis methods are either restricted to fixed skeletal topologies or require costly optimization in high-dimensional deformation spaces. In contrast, AnimaX effectively transfers video-based motion knowledge to the 3D domain, supporting diverse articulated meshes with arbitrary skeletons. Our method represents 3D motion as multi-view, multi-frame 2D pose maps, and enables joint video-pose diffusion conditioned on template renderings and a textual motion prompt. We introduce shared positional encodings and modality-aware embeddings to ensure spatial-temporal alignment between video and pose sequences, effectively transferring video priors to motion generation task. The resulting multi-view pose sequences are triangulated into 3D joint positions and converted into mesh animation via inverse kinematics. Trained on a newly curated dataset of 160,000 rigged sequences, AnimaX achieves state-of-the-art results on VBench in generalization, motion fidelity, and efficiency, offering a scalable solution for category-agnostic 3D animation. Project page: \href{https://anima-x.github.io/}{https://anima-x.github.io/}.

Summary

The paper introduces a novel feed-forward framework for animating arbitrary 3D meshes using joint video and pose diffusion, bypassing expensive high-dimensional optimization.
It employs multi-view, multi-frame 2D pose maps with modality-aware embeddings to enforce spatial-temporal alignment and robust 3D motion synthesis.
Empirical results demonstrate superior generalization and efficiency, finishing animations in 6 minutes versus 25 hours for traditional optimization techniques.

AnimaX: Feed-Forward 3D Animation via Joint Video-Pose Diffusion

AnimaX introduces a feed-forward framework for animating arbitrary 3D articulated meshes by leveraging the motion priors of large-scale video diffusion models and the controllability of skeleton-based animation. The method addresses the limitations of prior approaches, which are either restricted to fixed skeletal topologies or require computationally expensive optimization in high-dimensional deformation spaces. AnimaX achieves efficient, category-agnostic 3D animation by representing motion as multi-view, multi-frame 2D pose maps and employing a joint video-pose diffusion model conditioned on template renderings and textual motion prompts.

Methodological Overview

The core innovation of AnimaX is the joint modeling of RGB video and pose sequences within a unified diffusion framework. The system operates in two main stages:

Joint Video-Pose Generation:
- Given an articulated mesh and a textual motion description, AnimaX renders multi-view template images and pose maps.
- A joint video-pose diffusion model, initialized from a pre-trained video latent diffusion backbone, is conditioned on these templates and the text prompt.
- The model simultaneously generates multi-view RGB videos and corresponding pose sequences, ensuring spatial-temporal alignment via shared positional encodings and modality-aware embeddings.
3D Motion Reconstruction:
- 2D joint positions are extracted from the generated pose maps using color clustering.
- Multi-view triangulation recovers 3D joint positions, followed by inverse kinematics to estimate joint angles and animate the mesh.

This approach enables efficient, feed-forward animation synthesis, circumventing the need for iterative optimization or reliance on fixed skeletons.

Architectural Details

Pose Representation:

3D motion is encoded as multi-view, multi-frame 2D pose maps, where each joint is projected onto the image plane with a unique color encoding. This facilitates accurate localization and subsequent 3D reconstruction.

Joint Diffusion Model:

The model concatenates RGB and pose latent tokens along the temporal axis, allowing 3D self-attention layers to operate jointly across modalities. Modality-specific embeddings and shared positional encodings enforce alignment between video and pose streams, enabling effective transfer of motion priors.

Multi-View Consistency:

Camera poses are encoded using Plücker ray maps, concatenated with latent representations. Multi-view attention layers aggregate information across views, promoting spatial consistency in the generated outputs.

Training Regime:

The model is trained on a curated dataset of 161,023 rigged animation clips spanning diverse categories (humanoids, animals, articulated objects). A two-stage training strategy is employed: initial single-view fine-tuning with LoRA, followed by multi-view extension with frozen backbone weights.

Empirical Results

AnimaX demonstrates strong performance on the VBench benchmark, outperforming prior methods such as Animate3D and MotionDreamer in generalization, motion fidelity, and efficiency. Key quantitative results include:

Method	I2V Subject	Smoothness	Dynamic Degree	Quality
Animate3D	0.943	0.986	0.446	0.481
MotionDreamer	0.817	0.977	0.827	0.439
AnimaX	0.962	0.990	0.661	0.517

User studies further corroborate the superiority of AnimaX in motion-text alignment, shape consistency, and overall motion quality, with preference rates exceeding 70% across all metrics.

Efficiency and Scalability

A notable advantage of AnimaX is its runtime efficiency. The entire animation pipeline, including multi-view video-pose generation and 3D reconstruction, completes in approximately 6 minutes per sequence. This is a substantial improvement over optimization-based methods such as AKD, which require up to 25 hours per animation. The feed-forward design enables practical deployment in interactive or large-scale content creation scenarios.

Ablation and Design Analysis

Ablation studies highlight the importance of joint video-pose modeling and shared positional encodings. Models trained to generate pose sequences alone, or without shared positional encodings, exhibit degraded spatial alignment and motion quality. The full joint model achieves the best quantitative and qualitative results, confirming the efficacy of the architectural choices.

Limitations and Future Directions

Current limitations include:

Fixed Camera Viewpoints: The model is trained and evaluated on a fixed set of camera views, limiting its ability to handle large spatial motions or dynamic camera trajectories. Extending the dataset and architecture to support arbitrary camera paths is a promising direction.
Video Length Constraints: The maximum sequence length is inherited from the video diffusion backbone, restricting the generation of long-form animations. Incorporating autoregressive denoising or test-time training could address this limitation.

Implications and Prospects

AnimaX represents a significant step toward scalable, category-agnostic 3D animation driven by text and visual priors. By bridging the gap between video-based motion knowledge and structured skeleton-based animation, it enables efficient synthesis of diverse, high-fidelity animations for arbitrary articulated meshes. The curated dataset and architectural innovations provide a foundation for further research in controllable 3D content generation, with potential applications in virtual production, gaming, and digital avatars.

Future work may explore:

Integration with dynamic camera control and scene interaction.
Extension to longer and more complex animation sequences.
Incorporation of additional modalities (e.g., audio-driven animation).
Real-time or on-device deployment for interactive applications.

AnimaX establishes a practical paradigm for leveraging large-scale generative models in 3D animation, offering both theoretical insights and concrete tools for the community.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1937713666599629048

https://twitter.com/HuggingPapers/status/1937934088087740859

https://twitter.com/javaeeeee1/status/1937837066160631951