CyberHost: Taming Audio-driven Avatar Diffusion Model with Region Codebook Attention (2409.01876v3)
Abstract: Diffusion-based video generation technology has advanced significantly, catalyzing a proliferation of research in human animation. However, the majority of these studies are confined to same-modality driving settings, with cross-modality human body animation remaining relatively underexplored. In this paper, we introduce, an end-to-end audio-driven human animation framework that ensures hand integrity, identity consistency, and natural motion. The key design of CyberHost is the Region Codebook Attention mechanism, which improves the generation quality of facial and hand animations by integrating fine-grained local features with learned motion pattern priors. Furthermore, we have developed a suite of human-prior-guided training strategies, including body movement map, hand clarity score, pose-aligned reference feature, and local enhancement supervision, to improve synthesis results. To our knowledge, CyberHost is the first end-to-end audio-driven human diffusion model capable of facilitating zero-shot video generation within the scope of human body. Extensive experiments demonstrate that CyberHost surpasses previous works in both quantitative and qualitative aspects.
- Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023a.
- Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 22563–22575, 2023b.
- Echomimic: Lifelike audio-driven portrait animations through editable landmark conditions. arXiv preprint arXiv:2407.08136, 2024.
- Vlogger: Multimodal diffusion for embodied avatar synthesis. arXiv preprint arXiv:2403.08764, 2024.
- Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4690–4699, 2019.
- Learning individual styles of conversational gesture. In Computer Vision and Pattern Recognition (CVPR). IEEE, June 2019.
- Densepose: Dense human pose estimation in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7297–7306, 2018.
- Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. In International Conference on Learning Representations (ICLR), 2023.
- Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:2211.13221, 2022.
- Diffted: One-shot audio-driven ted talk video generation with diffusion-based co-speech gestures. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1922–1931, 2024.
- Animate anyone: Consistent and controllable image-to-video synthesis for character animation, 2023.
- Make-your-anchor: A diffusion-based 2d avatar generation framework. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6997–7006, 2024.
- Dreampose: Fashion image-to-video synthesis via stable diffusion, 2023.
- Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
- Speech2video synthesis with 3d skeleton regularization and expressive body poses. In Proceedings of the Asian Conference on Computer Vision, 2020.
- Dreamtalk: When expressive talking head generation meets diffusion probabilistic models. arXiv preprint arXiv:2312.09767, 2023.
- Expressive body capture: 3d hands, face, and body from a single image. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10975–10985, 2019.
- Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
- A lip sync expert is all you need for speech to lip generation in the wild. In Proceedings of the 28th ACM international conference on multimedia, pp. 484–492, 2020.
- wav2vec: Unsupervised pre-training for speech recognition. arXiv preprint arXiv:1904.05862, 2019.
- Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022.
- Emo: Emote portrait alive-generating expressive portrait videos with audio2video diffusion model under weak conditions. arXiv preprint arXiv:2402.17485, 2024.
- Video generation models as world simulators. 2024. URL https://openai.com/index/sora/. Accessed: 2024-02-15.
- Fvd: A new metric for video generation. 2019.
- Dreamvideo: High-fidelity image-to-video generation with image retention and text guidance. arXiv preprint arXiv:2312.03018, 2023a.
- V-express: Conditional dropout for progressive training of portrait video generation. arXiv preprint arXiv:2406.02511, 2024a.
- Disco: Disentangled control for realistic human dance generation, 2024b.
- One-shot free-view neural talking-head synthesis for video conferencing. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10039–10049, 2021.
- Videocomposer: Compositional video synthesis with motion controllability. Advances in Neural Information Processing Systems (NeurIPS), 36, 2024c.
- Dr2: Diffusion-based robust degradation remover for blind face restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1704–1713, 2023b.
- Easyanimate: A high-performance long video generation method based on transformer architecture. arXiv preprint arXiv:2405.18991, 2024a.
- Hallo: Hierarchical audio-driven visual synthesis for portrait image animation. arXiv preprint arXiv:2406.08801, 2024b.
- Vasa-1: Lifelike audio-driven talking faces generated in real time. arXiv preprint arXiv:2404.10667, 2024c.
- Magicanimate: Temporally consistent human image animation using diffusion model, 2023.
- Effective whole-body pose estimation with two-stages distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 4210–4220, 2023.
- Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072, 2024.
- Styleheat: One-shot high-resolution editable talking face generation via pre-trained stylegan. In European conference on computer vision, pp. 85–101. Springer, 2022.
- Sadtalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8652–8661, 2023.
- Mimicmotion: High-quality human motion video generation with confidence-aware pose guidance. arXiv preprint arXiv:2406.19680, 2024.
- Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018, 2022.
- Celebv-hq: A large-scale video facial attributes dataset. In European conference on computer vision, pp. 650–667. Springer, 2022.
- Taming diffusion models for audio-driven co-speech gesture generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10544–10553, 2023a.
- Tryondiffusion: A tale of two unets. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4606–4615, 2023b.