UniAnimate-DiT: Human Image Animation with Large-Scale Video Diffusion Transformer (2504.11289v1)

Published 15 Apr 2025 in cs.CV

Abstract: This report presents UniAnimate-DiT, an advanced project that leverages the cutting-edge and powerful capabilities of the open-source Wan2.1 model for consistent human image animation. Specifically, to preserve the robust generative capabilities of the original Wan2.1 model, we implement Low-Rank Adaptation (LoRA) technique to fine-tune a minimal set of parameters, significantly reducing training memory overhead. A lightweight pose encoder consisting of multiple stacked 3D convolutional layers is designed to encode motion information of driving poses. Furthermore, we adopt a simple concatenation operation to integrate the reference appearance into the model and incorporate the pose information of the reference image for enhanced pose alignment. Experimental results show that our approach achieves visually appearing and temporally consistent high-fidelity animations. Trained on 480p (832x480) videos, UniAnimate-DiT demonstrates strong generalization capabilities to seamlessly upscale to 720P (1280x720) during inference. The training and inference code is publicly available at https://github.com/ali-vilab/UniAnimate-DiT.

PDF Abstract

Overview of UniAnimate-DiT: Human Image Animation with Large-Scale Video Diffusion Transformer

The paper introduces UniAnimate-DiT, an innovative system aimed at enhancing human image animation through the utilization of advanced generative models, specifically leveraging a Video Diffusion Transformer (DiT) known as Wan2.1. The primary goal of this work is to improve the temporal coherence and quality of videos generated from static human images, a domain that has seen substantial advances with the introduction of diffusion models.

Methodology

UniAnimate-DiT employs an efficient adaptation strategy that conserves the original capabilities of the Wan2.1 model while optimizing memory use through Low-Rank Adaptation (LoRA). This fine-tuning technique, which focuses on a limited set of parameters, allows the model to be trained more efficiently without compromising its performance.

The framework comprises several core components:

Video Diffusion Transformer (Wan2.1): Serves as the foundation for robust video generation.
LoRA Fine-tuning: Reduces the computational burden of training by adapting only crucial parameters.
Pose Encoder: A lightweight module using 3D convolutional layers to encode the motion information of driving poses.
Reference-Pose Encoder: Aids in integrating appearance alignment through reference pose information.

The approach demonstrated significant improvement in maintaining visual appearance and temporal consistency, generating high-fidelity animations even when trained on lower resolution datasets (480p). Importantly, the system exhibits the capability to upscale to 720p during inference, highlighting its adaptability and scalability in practical applications.

Experimental Results

The UniAnimate-DiT model was trained using a dataset consisting of approximately 10K human dance videos, capturing diverse movements and lighting conditions. Qualitative evaluations underscore the efficacy of the system, with results showcasing lifelike and continuous animations that maintain high degrees of both temporal and spatial fidelity.

The implications of these results are multifaceted. The improved scalability from 480p to 720p without additional training supports broader applications in contexts requiring high-resolution video outputs. Furthermore, the use of state-of-the-art generative models, paired with efficient adaptation strategies, underscores the potential for these techniques to transform image-based input into complex and coherent video outputs seamlessly.

Discussion and Future Directions

The research sets a benchmark for future explorations in image-to-video synthesis, particularly in enhancing animation tasks involving human figures. The focus on temporal consistency and memory-efficient training embodies a significant advancement in the diffusion model landscape.

Potential areas for future exploration include extending the model to support longer video synthesis through strategies such as the overlapped sliding window approach. Moreover, further refinement of pose encoding and integration techniques could yield even more robust results, pushing the boundaries of seamless character animation.

Overall, UniAnimate-DiT represents a valuable contribution to the field of generative video modeling, providing a versatile framework applicable across various domains in AI-driven media synthesis. As the field progresses, integrating such sophisticated techniques with broader AI applications promises to unlock new potentials in computer vision, entertainment, and beyond.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Xiang Wang (279 papers)
Shiwei Zhang (179 papers)
Longxiang Tang (22 papers)
Yingya Zhang (43 papers)
Changxin Gao (76 papers)
Yuehuan Wang (7 papers)
Nong Sang (86 papers)

Related Papers

Find Related Papers

GitHub

GitHub - ali-vilab/UniAnimate-DiT: UniAnimate-DiT: Human Image Animation with Large-Scale Video Diffusion Transformer (24 stars)

Tweets

https://twitter.com/StevenZhang66/status/1913208834455187550

https://twitter.com/VoidlessDev/status/1914361079011959131

YouTube

Show All Videos