Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

OmniAvatar: Efficient Audio-Driven Avatar Video Generation with Adaptive Body Animation (2506.18866v1)

Published 23 Jun 2025 in cs.CV, cs.AI, and cs.MM

Abstract: Significant progress has been made in audio-driven human animation, while most existing methods focus mainly on facial movements, limiting their ability to create full-body animations with natural synchronization and fluidity. They also struggle with precise prompt control for fine-grained generation. To tackle these challenges, we introduce OmniAvatar, an innovative audio-driven full-body video generation model that enhances human animation with improved lip-sync accuracy and natural movements. OmniAvatar introduces a pixel-wise multi-hierarchical audio embedding strategy to better capture audio features in the latent space, enhancing lip-syncing across diverse scenes. To preserve the capability for prompt-driven control of foundation models while effectively incorporating audio features, we employ a LoRA-based training approach. Extensive experiments show that OmniAvatar surpasses existing models in both facial and semi-body video generation, offering precise text-based control for creating videos in various domains, such as podcasts, human interactions, dynamic scenes, and singing. Our project page is https://omni-avatar.github.io/.

Summary

  • The paper introduces OmniAvatar, a framework for efficient audio-driven avatar video generation achieving state-of-the-art performance using diffusion models.
  • OmniAvatar employs pixel-wise multi-hierarchical audio embedding for fine-grained lip-sync and holistic body motion adaptation, improving upon previous cross-attention methods.
  • By using LoRA on a pre-trained diffusion transformer and a latent overlapping strategy, OmniAvatar preserves identity, ensures temporal consistency in long videos, and enables prompt-driven control.

OmniAvatar: Efficient Audio-Driven Avatar Video Generation with Adaptive Body Animation

OmniAvatar presents a unified framework for audio-driven full-body avatar video generation, addressing persistent challenges in synchronizing lip movements, generating natural body animation, and enabling precise prompt-based control. The model leverages a combination of pixel-wise multi-hierarchical audio embedding and LoRA-based adaptation atop a large-scale diffusion transformer (DiT) video foundation model, specifically Wan2.1-T2V-14B. This approach enables high-fidelity, temporally consistent, and controllable avatar video synthesis from a single reference image, audio input, and textual prompt.

Technical Contributions

OmniAvatar introduces several key innovations:

  • Pixel-wise Multi-Hierarchical Audio Embedding: Unlike prior methods that rely on cross-attention to inject audio features—often leading to computational overhead and limited spatial influence—OmniAvatar directly embeds audio features at the pixel level within the latent space. This is achieved by compressing audio features (extracted via Wav2Vec2) to match the temporal resolution of the video latents, then projecting and fusing them into the video latent representations at multiple layers of the DiT. This design ensures both fine-grained lip-sync and holistic body motion adaptation to audio.
  • LoRA-based Model Adaptation: To efficiently adapt the large pre-trained video diffusion model to the audio-driven task without catastrophic forgetting or overfitting, OmniAvatar employs Low-Rank Adaptation (LoRA) on the attention and feed-forward layers of the DiT. This allows the model to learn audio-conditioned behaviors while preserving the generative capacity and prompt controllability of the base model.
  • Long Video Generation with Identity and Temporal Consistency: The inference pipeline incorporates reference image embedding for identity preservation and a latent overlapping strategy for temporal continuity. By overlapping frames between generated segments and anchoring identity via repeated reference latents, the model produces long, coherent videos with smooth transitions.
  • Prompt-Driven Control: By retaining the prompt-conditioning capabilities of the foundation model, OmniAvatar supports nuanced text-based control over gestures, emotions, backgrounds, and scene context, enabling applications beyond simple talking heads.

Experimental Results

OmniAvatar is evaluated on the AVSpeech and HDTF datasets, with both facial and semi-body test sets. The model demonstrates strong quantitative and qualitative performance:

  • Lip-Sync Accuracy: Achieves leading Sync-C and Sync-D scores, indicating precise alignment between audio and lip movements.
  • Visual Quality: Outperforms or matches state-of-the-art methods in FID, FVD, and IQA metrics, reflecting high image and video fidelity.
  • Body Animation: Generates fluid, natural upper-body movements, surpassing prior works that often produce stiff or unnatural poses.
  • Prompt Sensitivity: Enables fine-grained control over gestures, emotions, and backgrounds, as evidenced by diverse qualitative results.

Key Numerical Results (AVSpeech Semi-Body Test Set)

Method FID ↓ FVD ↓ Sync-C ↑ Sync-D ↓ IQA ↑ ASE ↑
Hallo3 104 1078 5.23 9.54 3.41 2.00
FantasyTalking 78.9 780 3.14 11.2 3.33 1.96
HunyuanAvatar 77.7 887 6.71 8.35 3.61 2.16
MultiTalk 74.7 787 4.76 9.99 3.67 2.22
OmniAvatar 67.6 664 7.12 8.05 3.75 2.25

OmniAvatar consistently achieves the best or near-best results across all metrics, particularly excelling in lip-sync and overall video quality.

Ablation Studies

Ablation experiments confirm the effectiveness of the proposed components:

  • LoRA vs. Full Training: Full fine-tuning of the DiT leads to overfitting and degraded visual quality, while LoRA preserves generative fidelity and enables effective audio adaptation.
  • Multi-Hierarchical vs. Single-Layer Audio Embedding: Multi-hierarchical embedding yields superior audio-visual alignment and synchronization.
  • Classifier-Free Guidance (CFG): Moderate CFG values (e.g., 4.5) optimize the trade-off between synchronization and naturalness; excessive CFG can cause exaggerated or unnatural expressions.

Implementation Considerations

  • Computational Requirements: Training is conducted on 64 A100 80GB GPUs, reflecting the high resource demands of large-scale video diffusion models.
  • Data Filtering: High-quality training data is curated using SyncNet and Q-Align to ensure accurate lip-sync and visual fidelity.
  • Inference Efficiency: The diffusion-based approach requires multiple denoising steps (e.g., 25 steps per video), which, while tractable for offline generation, may limit real-time applications.

Limitations

  • Inherited Model Weaknesses: Issues such as color shifts and error accumulation in long videos persist, inherited from the Wan base model.
  • Multi-Character and Fine-Grained Control: Handling multi-character interactions and complex conversational scenarios remains challenging.
  • Inference Latency: The diffusion process is computationally intensive, impeding real-time deployment.

Implications and Future Directions

OmniAvatar advances the state of audio-driven avatar video generation by unifying high-fidelity synthesis, prompt-based control, and efficient adaptation. Practically, this enables applications in virtual assistants, digital content creation, remote communication, and entertainment, where lifelike, controllable avatars are essential.

Theoretically, the work demonstrates the efficacy of direct pixel-wise audio embedding and LoRA-based adaptation for multimodal generative tasks. Future research may focus on:

  • Reducing inference latency via distillation or hybrid architectures.
  • Extending prompt control to multi-character and interactive scenarios.
  • Addressing error propagation and visual artifacts in long-duration synthesis.
  • Exploring more efficient or lightweight foundation models for broader accessibility.

OmniAvatar establishes a robust foundation for controllable, high-quality, audio-driven avatar video generation, with clear pathways for further technical and practical advancements.