Papers
Topics
Authors
Recent
2000 character limit reached

KlingAvatar 2.0: Audio-Driven Avatar Synthesis

Updated 20 December 2025
  • KlingAvatar 2.0 is a spatio-temporal cascade framework for audio-driven avatar video synthesis that produces long-duration, high-resolution videos.
  • It employs a two-stage cascade with low-res blueprint keyframe generation and high-res refinement, enhanced by modality-specific LLM experts for coherent multimodal instructions.
  • Its design supports precise multi-character control and robust identity preservation, achieving superior visual fidelity, lip sync, and temporal coherence compared to earlier models.

KlingAvatar 2.0 is a spatio-temporal cascade framework for audio-driven avatar video synthesis that targets efficient generation of long-duration, high-resolution videos with robust multimodal instruction following and fine-grained multi-character control. The system integrates a multi-stage diffusion architecture, a Co-Reasoning Director composed of modality-specific LLM experts, a Negative Director for negative prompt refinement, and innovations in identity preservation and temporal coherence. KlingAvatar 2.0 enables the synthesis of avatar videos up to five minutes long at resolutions up to 512×512 (and higher via super-resolution), allowing for realistic lip–teeth rendering, vivid expressions, shot-level multimodal control, and support for scenes with multiple speaker identities (Team et al., 15 Dec 2025).

1. Spatio-Temporal Cascade Modeling

The KlingAvatar 2.0 architecture employs a two-stage spatio-temporal cascade strategy to address efficiency and quality degradation in long-form video generation:

  • Low-Resolution Blueprint Keyframe Generation:

The system first takes a reference image II, a full audio track {at}t=1T\{a_t\}_{t=1}^T, positive instructions P+P^+, and negative prompts P−P^-. A DiT-based diffusion model Dℓ\mathcal{D}_\ell creates a temporally compressed, low-resolution video VL∈RTL×HL×WL×CV^L\in\mathbb{R}^{T_L\times H_L\times W_L\times C} that captures global semantics and motion. The forward denoising step is formalized as

zτ−1=zτ−ατ ϵθ(zτ,τ,P+,P−)+στ ξ,ξ∼N(0,I).z_{\tau-1} = z_{\tau} - \alpha_{\tau}\,\epsilon_\theta(z_{\tau},\tau,P^+,P^-) + \sigma_{\tau}\, \xi, \quad \xi\sim\mathcal{N}(0,I).

  • High-Resolution Sub-clip Refinement:

Sparse keyframes {tk}\{t_k\} are extracted, and each low-resolution keyframe vtkLv^L_{t_k} is upscaled:

vtkH=Upsamplespatial(vtkL,s),(H,W)=sâ‹…(HL,WL).v^H_{t_k} = \mathrm{Upsample}_{\mathrm{spatial}}(v^L_{t_k}, s), \quad (H, W) = s\cdot (H_L, W_L).

Sub-clips between keyframes tkt_k and tk+1t_{k+1} are generated with first–last frame conditioning:

vtL=G(vtkH,vtk+1H,at,P+,P−),tk<t<tk+1.v^L_t = \mathcal{G}(v^H_{t_k}, v^H_{t_{k+1}}, a_t, P^+, P^-), \quad t_k < t < t_{k+1}.

Audio-aware temporal interpolation inserts intermediate frames v~t\tilde v_t via

v~t=αt vt−1L+(1−αt) vt+1L\tilde v_t = \alpha_t\,v^L_{t-1} + (1-\alpha_t)\,v^L_{t+1}

with learned weights αt∈[0,1]\alpha_t\in[0,1]. Finally, a high-res DiT Dh\mathcal{D}_h upsamples sub-clips to full video resolution.

This cascade method addresses difficulties in prior work such as temporal drifting, resolution bottlenecks, and quality falloff as video duration increases (Team et al., 15 Dec 2025).

2. Co-Reasoning Director and Negative Prompting

KlingAvatar 2.0 introduces a Co-Reasoning Director comprising three LLM experts, each specialized by modality, for comprehensive instruction interpretation across audio, vision, and text:

  • Audio Expert (EaE_a): Transcribes the audio input, extracting paralinguistic features such as emotion and prosody.
  • Visual Expert (EvE_v): Summarizes the reference image and infers scene layout.
  • Textual Expert (EtE_t): Parses detailed instructions and reconciles prior context.

The director conducts multi-turn, chain-of-thought dialogue among the experts. At each iteration kk:

  1. ha(k)=Ea(a1:T,hv(k−1),ht(k−1))h_a^{(k)} = E_a(a_{1:T}, h_v^{(k-1)}, h_t^{(k-1)})
  2. hv(k)=Ev(I,ha(k),ht(k−1))h_v^{(k)} = E_v(I, h_a^{(k)}, h_t^{(k-1)})
  3. ht(k)=Et(P+,ha(k),hv(k))h_t^{(k)} = E_t(P^+, h_a^{(k)}, h_v^{(k)})

Ultimately, the output is a structured plan S={(tk,Pk+,Pk−,ck)}k=1MS = \{(t_k, P^+_k, P^-_k, c_k)\}_{k=1}^M with timing, prompt segments, negative cues, and shot-level parameters ckc_k (camera/motion).

A complementary Negative Director is responsible for generating negative prompts P−P^- to suppress unwanted artifacts and enforce instruction alignment. Its training loss encourages contrastive prediction between positive and negative guidance:

Lneg=Ez,τ[∥ϵθ(z,τ,P+,P−)−ϵθ(z,τ,P+,∅)∥22].\mathcal{L}_{\mathrm{neg}} = \mathbb{E}_{z,\tau}\left[\left\| \epsilon_\theta(z,\tau,P^+,P^-) - \epsilon_\theta(z,\tau,P^+,\varnothing)\right\|_2^2\right].

This promotes stable video generation and minimizes spurious artifacts (Team et al., 15 Dec 2025).

3. ID-Specific Multi-Character Control

KlingAvatar 2.0 extends to scenarios with multiple speaker identities and audio streams through the following mechanisms:

  • Mask-Prediction Head: Deep DiT features FF are processed with a mask head to predict identity masks MiM^i per character:

Mi=σ(Wm CrossAttn(F,Eidi)),∑iMi≤1,M^i = \sigma\left(W_m\, \mathrm{CrossAttn}(F, E_{\mathrm{id}^i})\right),\quad \sum_i M^i \leq 1,

where EidiE_{\mathrm{id}^i} is the embedding from the identity reference crop.

  • Audio-Injection Gating: During denoising, latents are updated by:

zτ−1=zτ−∑i=1CMi⊙ατi ϵθ(F,ai,P+,P−)−(1−∑iMi)⊙ατ0 ϵθ(⋯ ),z_{\tau-1} = z_\tau - \sum_{i=1}^C M^i\odot \alpha_\tau^i\,\epsilon_\theta(F,a^i,P^+,P^-) - (1-\sum_i M^i)\odot \alpha_\tau^0\,\epsilon_\theta(\cdots),

ensuring regional fidelity to each character’s designated audio.

This design supports scenes with multiple speakers, maintaining strict regional and identity boundaries, and preventing cross-speaker audio or appearance "bleeding" (Team et al., 15 Dec 2025).

4. Training Data, Evaluation, and Metrics

The KlingAvatar 2.0 system is trained and evaluated on a cinematic-level, multi-character video corpus, including data from podcasts, interviews, and television series—with video lengths up to 5 minutes and blueprint resolutions at 256×256, refined to 512×512 (and optionally 768×768). Key evaluation metrics are:

Metric Description
GSB Pairwise Preference GSBoverall=G+SG+S+B\mathrm{GSB}_{\mathrm{overall}} = \frac{G + S}{G + S + B}
Face–Lip Sync Score 1−1T∑t=1T∥f(vt)−ϕ(at)∥11 - \frac{1}{T}\sum_{t=1}^T \|f(v_t) - \phi(a_t)\|_1
FID (Visual Fidelity) ∥μr−μg∥2+Tr(Σr+Σg−2(ΣrΣg)1/2)\|\mu_r - \mu_g\|^2 + \mathrm{Tr}(\Sigma_r + \Sigma_g - 2(\Sigma_r\Sigma_g)^{1/2})
Identity Preservation Mean cosine similarity in a face-recognition embedding space

KlingAvatar 2.0 demonstrates quantitative improvements over baseline models such as HeyGen, KlingAvatar 1.0, and OmniHuman-1.5. For example, the overall GSB preference ratios are 1.26 (vs. HeyGen), 1.73 (vs. Kling-Avatar 1.0), and 1.94 (vs. OmniHuman-1.5). Substantial gains are also evident in visual clarity, lip synchronization, and text prompt relevance (Team et al., 15 Dec 2025).

5. Qualitative Properties and System Analysis

The cascade and director modules yield several qualitative advances:

  • Visual Fidelity: Superior skin, hair, and tooth resolution attributable to cascade super-resolution.
  • Lip–Teeth Dynamics: Synchronized, fine-grained lip and teeth movements reflecting phonetic content.
  • Identity Consistency: Sustained facial geometry and appearance over long segments, minimizing drift.
  • Temporal Coherence: Suppression of frame flicker and motion drift, enabling smooth transitions and camera work.
  • Multimodal Prompt Realization: Accurate execution of camera moves, gestures, and emotional tone, strictly respecting multimodal prompt guidance.
  • Multi-Character Segregation: Precise speaker separation, ensuring identity and audio fidelity per region.

These properties collectively address longstanding challenges in extended avatar video generation, such as temporal instability, instruction misalignment, and identity drift across frames (Team et al., 15 Dec 2025).

6. Limitations and Projected Directions

KlingAvatar 2.0 exhibits several challenges:

  • Scalability: Ultra high resolutions (e.g., 1080p+) impose significant computational loads.
  • Extremely Long Horizons: Subtle temporal drift may accumulate for videos exceeding 5 minutes; integration of memory-augmented world models or periodic re-anchoring could enhance stability.
  • Negative Director Heuristics: The current implementation of negative prompt generation remains heuristic, with the potential for future work on differentiable, end-to-end negative prompt learning pipelines.
  • 3D Awareness: The super-resolution module operates in 2D, and future extensions to 3D-aware diffusion models may facilitate free-viewpoint rendering and head-pose control.
  • Interactive Editing: Absence of real-time, incremental generation and feedback limits on-the-fly script or identity adjustments.

These limitations point towards research avenues in more efficient diffusion techniques, advanced negative-prompt training, model scalability, and interactive video synthesis workflows (Team et al., 15 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to KlingAvatar 2.0.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube