Papers
Topics
Authors
Recent
2000 character limit reached

JoyVASA: Diffusion Audio-Driven Animation

Updated 30 January 2026
  • JoyVASA is a diffusion-based system that decouples static facial identity from dynamic, audio-driven motion, enabling cross-domain portrait animation.
  • It utilizes a two-stage pipeline that first extracts static and motion features using LivePortrait and then maps audio features to motion sequences via a diffusion transformer.
  • Evaluations show high temporal coherence, improved IQA, VQA, and lip-sync accuracy, with future work aimed at addressing challenges in extreme poses and cross-identity retargeting.

JoyVASA is a diffusion-based system for audio-driven animation of facial dynamics and head motion, designed to support both human portraits and animal images. It employs a two-stage pipeline that disentangles static facial identity from dynamic motion, enabling identity-agnostic, temporally coherent generation of expressive talking-head videos. By integrating robust representations with diffusion transformers, JoyVASA advances the state of the art in high-fidelity, cross-domain portrait animation while addressing challenges of video length, inter-frame continuity, and generalization to non-human faces (Cao et al., 2024).

1. Architecture and Methodology

JoyVASA utilizes a two-stage pipeline comprising decoupled facial representation and audio-driven motion generation:

  1. Stage I – Decoupled Facial Representation: Leveraging the LivePortrait framework, a talking-face video is decomposed into:
    • A static 3D identity embedding ffacef_{\text{face}} that encodes appearance features.
    • Dynamic motion features X={R,t,δ,s}X = \{R, t, \delta, s\}, representing head pose and facial expressions. This separation permits re-use of any static portrait with synthesized or real motion sequences, supports long-duration outputs, and enables cross-identity retargeting.
  2. Stage II – Audio-Driven Motion Generation: A diffusion transformer is trained to map audio features {faudioi}i=1N\{f_{\text{audio}}^i\}_{i=1}^N (from a frozen wav2vec2) to canonical-space motion sequences {Xi}i=1N\{X_i\}_{i=1}^N. Motion generation is strictly identity-agnostic, enabling seamless animation across humans and animals.

2. Mathematical Foundations

The decoupled approach is formalized as follows:

Given a canonical 3D keypoint set xc∈RK×3x_c \in \mathbb{R}^{K \times 3} and per-frame motion parameters s∈Rs \in \mathbb{R}, R∈SO(3)R \in SO(3), t∈R3t \in \mathbb{R}^3, and δ∈RK×3\delta \in \mathbb{R}^{K \times 3}, the deformation operator TT defines the transformation:

xs=T(xc,ss,Rs,ts,δs)=ss(xcRs+δs)+tsx_s = T(x_c, s_s, R_s, t_s, \delta_s) = s_s\bigl(x_c R_s + \delta_s\bigr) + t_s

The appearance and motion encoders extract features as:

fface=Eapp(Is),[si,Ri,ti,δi]=Emnt(Ii)f_{\text{face}} = E_{\text{app}}(I_s), \qquad [s_i, R_i, t_i, \delta_i] = E_{\text{mnt}}(I_i)

Motion generation adopts a Denoising Diffusion Probabilistic Model (DDPM) with schedule {βt}t=1T\{\beta_t\}_{t=1}^T:

  • Forward diffusion:

q(Xt∣Xt−1)=N(Xt;1−βt Xt−1,βtI)q(X^t \mid X^{t-1}) = \mathcal{N}(X^t; \sqrt{1-\beta_t}\, X^{t-1}, \beta_t I)

  • Reverse process:

pθ(Xt−1∣Xt,C)=N(Xt−1;μθ(Xt,C,t),σt2I)p_\theta(X^{t-1} \mid X^t, C) = \mathcal{N}(X^{t-1}; \mu_\theta(X^t, C, t), \sigma_t^2 I)

3. Diffusion Transformer Design and Losses

The diffusion transformer backbone is a 6-layer Transformer decoder with model dimension d=512d=512 and 8 attention heads. At each time tt, inputs include: current noisy motion X0:WcurtX_{0:W_\text{cur}}^t, past clean motion X−Wpre:00X^0_{-W_\text{pre}:0}, and audio features A−Wpre:WcurA_{-W_\text{pre}:W_\text{cur}}, with sinusoidal positional encoding for temporal localization.

The denoising network DD predicts the clean motion sequence:

X^−Wpre:Wcur0=D(X0:Wcurt,X−Wpre:00,A−Wpre:Wcur,t)\hat{X}^0_{-W_{\text{pre}}:W_{\text{cur}}} = D(X^t_{0:W_{\text{cur}}}, X^0_{-W_{\text{pre}}:0}, A_{-W_{\text{pre}}:W_{\text{cur}}}, t)

Classifier-Free Guidance (CFG) is employed:

X^0=D(…,∅,t)+λc[D(…,A,t)−D(…,∅,t)]\hat{X}^0 = D(\dots,\varnothing,t) + \lambda_c \left[ D(\dots,A,t) - D(\dots,\varnothing,t) \right]

where ∅\varnothing denotes dropped audio input (10% probability) and λc\lambda_c is the guidance scale.

The aggregate loss function is:

Ltotal=Lsim+λvelLvel+λsmoothLsmooth+λexpLexpL_{\text{total}} = L_{\text{sim}} + \lambda_{\text{vel}} L_{\text{vel}} + \lambda_{\text{smooth}} L_{\text{smooth}} + \lambda_{\text{exp}} L_{\text{exp}}

with λvel=5.0\lambda_{\text{vel}} = 5.0, λsmooth=0.5\lambda_{\text{smooth}} = 0.5, and λexp=0.1\lambda_{\text{exp}} = 0.1. The specific terms are:

  1. Lsim=∥X0−X^0∥22L_{\text{sim}} = \| X^0 - \hat{X}^0 \|_2^2 (reconstruction)
  2. Lvel=∥ΔX0−ΔX^0∥22L_{\text{vel}} = \| \Delta X^0 - \Delta\hat{X}^0 \|_2^2 (velocity)
  3. Lsmooth=∥X^t+20−2X^t+10+X^t0∥22L_{\text{smooth}} = \| \hat{X}^0_{t+2} - 2\hat{X}^0_{t+1} + \hat{X}^0_t \|_2^2 (smoothness)
  4. Lexp=∥δ0−δ^0∥22L_{\text{exp}} = \| \delta^0 - \hat{\delta}^0 \|_2^2 (mouth expression)

4. Rendering Process

After sampling the motion sequence X^0\hat{X}^0, target keypoints xd=T(xc,X^0)x_d = T(x_c, \hat{X}^0) are computed. The appearance feature ffacef_{\text{face}} is transferred from the source configuration xsx_s to the new configuration xdx_d. A U-Net–style generator GG synthesizes output frames:

I^t=G(fface,xs,xd)\hat{I}_t = G(f_{\text{face}}, x_s, x_d)

While optional pixel-space losses such as L1\mathcal{L}_1 or perceptual losses (Lperc\mathcal{L}_{\text{perc}}) are available, the principal training signal derives from feature-space reconstruction via the LivePortrait decoder.

5. Data, Training Regime, and Preprocessing

JoyVASA is trained on a hybrid dataset of 5,578 video clips (ranging from 8 seconds to several minutes each), drawn from:

  • HDTF (public)
  • CelebV-HQ (public)
  • JD Health proprietary Chinese data

Preprocessing involves QAlign for video quality filtering, Sync-net for lip-sync alignment, and oversampling to balance dataset contributions. Audio features are extracted with a frozen wav2vec2 model, while motion features are obtained via a frozen LivePortrait encoder.

Optimization employs Adam with learning rate 1×10−41\times10^{-4}, batch size 16, and 20,000 total steps. Diffusion schedule uses a linear β\beta progression from 10−410^{-4} to 2×10−22\times10^{-2}. Window lengths are set to Wcur=100W_{\text{cur}} = 100 and Wpre=25W_{\text{pre}} = 25.

6. Evaluation and Comparative Results

Performance is assessed on the CelebV-HQ test set (50 subjects, 5–15 seconds each) using the following metrics:

Metric JoyVASA Aniportrait Notes
IQA (%) ↑\uparrow 68.97 74.85 Image Quality
VQA (%) ↑\uparrow 72.42 78.00 Video Quality
Sync-C ↑\uparrow 4.85 1.98 Lip-Sync Concordance
Sync-D ↓\downarrow 13.53 13.28 Lip-Sync Discrepancy
FVD-25 ↓\downarrow 459.04 (best) — Frechet Video Distance
Smooth (%) ↑\uparrow 99.60 (2nd best) — Motion Smoothness

On an openset of 50 arbitrary image/audio pairs: IQA 71.45, VQA 77.78, Sync-C 5.72, Sync-D 14.01, Smooth 99.48.

Qualitative evaluation demonstrates strong temporal coherence and expressive head motion; the model animates humans, cartoons, artwork, and animal faces without retraining.

7. Limitations and Prospects

Current limitations stem from Stage I, as LivePortrait representations may be suboptimal for large pose variations and lack a cross-identity retargeting module for audio-only inputs. Future research aims to incorporate more robust disentangled facial models (e.g., EMOPortrait), improve real-time inference speed via model pruning or distillation, and provide finer-grained expression control through emotion codes or user-editable parameters. These directions are anticipated to broaden JoyVASA's applicability across diverse animation scenarios (Cao et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to JoyVASA.