JoyVASA: Diffusion Audio-Driven Animation
- JoyVASA is a diffusion-based system that decouples static facial identity from dynamic, audio-driven motion, enabling cross-domain portrait animation.
- It utilizes a two-stage pipeline that first extracts static and motion features using LivePortrait and then maps audio features to motion sequences via a diffusion transformer.
- Evaluations show high temporal coherence, improved IQA, VQA, and lip-sync accuracy, with future work aimed at addressing challenges in extreme poses and cross-identity retargeting.
JoyVASA is a diffusion-based system for audio-driven animation of facial dynamics and head motion, designed to support both human portraits and animal images. It employs a two-stage pipeline that disentangles static facial identity from dynamic motion, enabling identity-agnostic, temporally coherent generation of expressive talking-head videos. By integrating robust representations with diffusion transformers, JoyVASA advances the state of the art in high-fidelity, cross-domain portrait animation while addressing challenges of video length, inter-frame continuity, and generalization to non-human faces (Cao et al., 2024).
1. Architecture and Methodology
JoyVASA utilizes a two-stage pipeline comprising decoupled facial representation and audio-driven motion generation:
- Stage I – Decoupled Facial Representation: Leveraging the LivePortrait framework, a talking-face video is decomposed into:
- A static 3D identity embedding that encodes appearance features.
- Dynamic motion features , representing head pose and facial expressions. This separation permits re-use of any static portrait with synthesized or real motion sequences, supports long-duration outputs, and enables cross-identity retargeting.
- Stage II – Audio-Driven Motion Generation: A diffusion transformer is trained to map audio features (from a frozen wav2vec2) to canonical-space motion sequences . Motion generation is strictly identity-agnostic, enabling seamless animation across humans and animals.
2. Mathematical Foundations
The decoupled approach is formalized as follows:
Given a canonical 3D keypoint set and per-frame motion parameters , , , and , the deformation operator defines the transformation:
The appearance and motion encoders extract features as:
Motion generation adopts a Denoising Diffusion Probabilistic Model (DDPM) with schedule :
- Forward diffusion:
- Reverse process:
3. Diffusion Transformer Design and Losses
The diffusion transformer backbone is a 6-layer Transformer decoder with model dimension and 8 attention heads. At each time , inputs include: current noisy motion , past clean motion , and audio features , with sinusoidal positional encoding for temporal localization.
The denoising network predicts the clean motion sequence:
Classifier-Free Guidance (CFG) is employed:
where denotes dropped audio input (10% probability) and is the guidance scale.
The aggregate loss function is:
with , , and . The specific terms are:
- (reconstruction)
- (velocity)
- (smoothness)
- (mouth expression)
4. Rendering Process
After sampling the motion sequence , target keypoints are computed. The appearance feature is transferred from the source configuration to the new configuration . A U-Net–style generator synthesizes output frames:
While optional pixel-space losses such as or perceptual losses () are available, the principal training signal derives from feature-space reconstruction via the LivePortrait decoder.
5. Data, Training Regime, and Preprocessing
JoyVASA is trained on a hybrid dataset of 5,578 video clips (ranging from 8 seconds to several minutes each), drawn from:
- HDTF (public)
- CelebV-HQ (public)
- JD Health proprietary Chinese data
Preprocessing involves QAlign for video quality filtering, Sync-net for lip-sync alignment, and oversampling to balance dataset contributions. Audio features are extracted with a frozen wav2vec2 model, while motion features are obtained via a frozen LivePortrait encoder.
Optimization employs Adam with learning rate , batch size 16, and 20,000 total steps. Diffusion schedule uses a linear progression from to . Window lengths are set to and .
6. Evaluation and Comparative Results
Performance is assessed on the CelebV-HQ test set (50 subjects, 5–15 seconds each) using the following metrics:
| Metric | JoyVASA | Aniportrait | Notes |
|---|---|---|---|
| IQA (%) | 68.97 | 74.85 | Image Quality |
| VQA (%) | 72.42 | 78.00 | Video Quality |
| Sync-C | 4.85 | 1.98 | Lip-Sync Concordance |
| Sync-D | 13.53 | 13.28 | Lip-Sync Discrepancy |
| FVD-25 | 459.04 (best) | — | Frechet Video Distance |
| Smooth (%) | 99.60 (2nd best) | — | Motion Smoothness |
On an openset of 50 arbitrary image/audio pairs: IQA 71.45, VQA 77.78, Sync-C 5.72, Sync-D 14.01, Smooth 99.48.
Qualitative evaluation demonstrates strong temporal coherence and expressive head motion; the model animates humans, cartoons, artwork, and animal faces without retraining.
7. Limitations and Prospects
Current limitations stem from Stage I, as LivePortrait representations may be suboptimal for large pose variations and lack a cross-identity retargeting module for audio-only inputs. Future research aims to incorporate more robust disentangled facial models (e.g., EMOPortrait), improve real-time inference speed via model pruning or distillation, and provide finer-grained expression control through emotion codes or user-editable parameters. These directions are anticipated to broaden JoyVASA's applicability across diverse animation scenarios (Cao et al., 2024).