- The paper introduces OmniAvatar, a framework for efficient audio-driven avatar video generation achieving state-of-the-art performance using diffusion models.
- OmniAvatar employs pixel-wise multi-hierarchical audio embedding for fine-grained lip-sync and holistic body motion adaptation, improving upon previous cross-attention methods.
- By using LoRA on a pre-trained diffusion transformer and a latent overlapping strategy, OmniAvatar preserves identity, ensures temporal consistency in long videos, and enables prompt-driven control.
OmniAvatar: Efficient Audio-Driven Avatar Video Generation with Adaptive Body Animation
OmniAvatar presents a unified framework for audio-driven full-body avatar video generation, addressing persistent challenges in synchronizing lip movements, generating natural body animation, and enabling precise prompt-based control. The model leverages a combination of pixel-wise multi-hierarchical audio embedding and LoRA-based adaptation atop a large-scale diffusion transformer (DiT) video foundation model, specifically Wan2.1-T2V-14B. This approach enables high-fidelity, temporally consistent, and controllable avatar video synthesis from a single reference image, audio input, and textual prompt.
Technical Contributions
OmniAvatar introduces several key innovations:
- Pixel-wise Multi-Hierarchical Audio Embedding: Unlike prior methods that rely on cross-attention to inject audio features—often leading to computational overhead and limited spatial influence—OmniAvatar directly embeds audio features at the pixel level within the latent space. This is achieved by compressing audio features (extracted via Wav2Vec2) to match the temporal resolution of the video latents, then projecting and fusing them into the video latent representations at multiple layers of the DiT. This design ensures both fine-grained lip-sync and holistic body motion adaptation to audio.
- LoRA-based Model Adaptation: To efficiently adapt the large pre-trained video diffusion model to the audio-driven task without catastrophic forgetting or overfitting, OmniAvatar employs Low-Rank Adaptation (LoRA) on the attention and feed-forward layers of the DiT. This allows the model to learn audio-conditioned behaviors while preserving the generative capacity and prompt controllability of the base model.
- Long Video Generation with Identity and Temporal Consistency: The inference pipeline incorporates reference image embedding for identity preservation and a latent overlapping strategy for temporal continuity. By overlapping frames between generated segments and anchoring identity via repeated reference latents, the model produces long, coherent videos with smooth transitions.
- Prompt-Driven Control: By retaining the prompt-conditioning capabilities of the foundation model, OmniAvatar supports nuanced text-based control over gestures, emotions, backgrounds, and scene context, enabling applications beyond simple talking heads.
Experimental Results
OmniAvatar is evaluated on the AVSpeech and HDTF datasets, with both facial and semi-body test sets. The model demonstrates strong quantitative and qualitative performance:
- Lip-Sync Accuracy: Achieves leading Sync-C and Sync-D scores, indicating precise alignment between audio and lip movements.
- Visual Quality: Outperforms or matches state-of-the-art methods in FID, FVD, and IQA metrics, reflecting high image and video fidelity.
- Body Animation: Generates fluid, natural upper-body movements, surpassing prior works that often produce stiff or unnatural poses.
- Prompt Sensitivity: Enables fine-grained control over gestures, emotions, and backgrounds, as evidenced by diverse qualitative results.
Key Numerical Results (AVSpeech Semi-Body Test Set)
Method |
FID ↓ |
FVD ↓ |
Sync-C ↑ |
Sync-D ↓ |
IQA ↑ |
ASE ↑ |
Hallo3 |
104 |
1078 |
5.23 |
9.54 |
3.41 |
2.00 |
FantasyTalking |
78.9 |
780 |
3.14 |
11.2 |
3.33 |
1.96 |
HunyuanAvatar |
77.7 |
887 |
6.71 |
8.35 |
3.61 |
2.16 |
MultiTalk |
74.7 |
787 |
4.76 |
9.99 |
3.67 |
2.22 |
OmniAvatar |
67.6 |
664 |
7.12 |
8.05 |
3.75 |
2.25 |
OmniAvatar consistently achieves the best or near-best results across all metrics, particularly excelling in lip-sync and overall video quality.
Ablation Studies
Ablation experiments confirm the effectiveness of the proposed components:
- LoRA vs. Full Training: Full fine-tuning of the DiT leads to overfitting and degraded visual quality, while LoRA preserves generative fidelity and enables effective audio adaptation.
- Multi-Hierarchical vs. Single-Layer Audio Embedding: Multi-hierarchical embedding yields superior audio-visual alignment and synchronization.
- Classifier-Free Guidance (CFG): Moderate CFG values (e.g., 4.5) optimize the trade-off between synchronization and naturalness; excessive CFG can cause exaggerated or unnatural expressions.
Implementation Considerations
- Computational Requirements: Training is conducted on 64 A100 80GB GPUs, reflecting the high resource demands of large-scale video diffusion models.
- Data Filtering: High-quality training data is curated using SyncNet and Q-Align to ensure accurate lip-sync and visual fidelity.
- Inference Efficiency: The diffusion-based approach requires multiple denoising steps (e.g., 25 steps per video), which, while tractable for offline generation, may limit real-time applications.
Limitations
- Inherited Model Weaknesses: Issues such as color shifts and error accumulation in long videos persist, inherited from the Wan base model.
- Multi-Character and Fine-Grained Control: Handling multi-character interactions and complex conversational scenarios remains challenging.
- Inference Latency: The diffusion process is computationally intensive, impeding real-time deployment.
Implications and Future Directions
OmniAvatar advances the state of audio-driven avatar video generation by unifying high-fidelity synthesis, prompt-based control, and efficient adaptation. Practically, this enables applications in virtual assistants, digital content creation, remote communication, and entertainment, where lifelike, controllable avatars are essential.
Theoretically, the work demonstrates the efficacy of direct pixel-wise audio embedding and LoRA-based adaptation for multimodal generative tasks. Future research may focus on:
- Reducing inference latency via distillation or hybrid architectures.
- Extending prompt control to multi-character and interactive scenarios.
- Addressing error propagation and visual artifacts in long-duration synthesis.
- Exploring more efficient or lightweight foundation models for broader accessibility.
OmniAvatar establishes a robust foundation for controllable, high-quality, audio-driven avatar video generation, with clear pathways for further technical and practical advancements.