GAIA: Zero-shot Talking Avatar Generation (2311.15230v2)
Abstract: Zero-shot talking avatar generation aims at synthesizing natural talking videos from speech and a single portrait image. Previous methods have relied on domain-specific heuristics such as warping-based motion representation and 3D Morphable Models, which limit the naturalness and diversity of the generated avatars. In this work, we introduce GAIA (Generative AI for Avatar), which eliminates the domain priors in talking avatar generation. In light of the observation that the speech only drives the motion of the avatar while the appearance of the avatar and the background typically remain the same throughout the entire video, we divide our approach into two stages: 1) disentangling each frame into motion and appearance representations; 2) generating motion sequences conditioned on the speech and reference portrait image. We collect a large-scale high-quality talking avatar dataset and train the model on it with different scales (up to 2B parameters). Experimental results verify the superiority, scalability, and flexibility of GAIA as 1) the resulting model beats previous baseline models in terms of naturalness, diversity, lip-sync quality, and visual quality; 2) the framework is scalable since larger models yield better results; 3) it is general and enables different applications like controllable talking avatar generation and text-instructed avatar generation.
- wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems (NeurIPS), 33:12449–12460, 2020.
- A morphable model for the synthesis of 3d faces. In Proceedings of the 26th annual conference on Computer graphics and interactive techniques, pp. 187–194, 1999.
- Neural head reenactment with latent pose descriptors. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 13786–13795, 2020.
- Voxceleb2: Deep speaker recognition. Interspeech, 2018.
- Out of time: automated lip sync in the wild. In Asian conference on computer vision, pp. 251–263. Springer, 2016.
- Real time speech enhancement in the waveform domain. In Interspeech, pp. 3291–3295, 2020.
- Retinaface: Single-shot multi-level face localisation in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5203–5212, 2020.
- Headgan: One-shot neural head synthesis and editing. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 14398–14407, 2021.
- Megaportraits: One-shot megapixel neural head avatars. In Proceedings of the 30th ACM International Conference on Multimedia, pp. 2663–2671, 2022.
- Dae-talker: High fidelity speech-driven talking face generation with diffusion autoencoder. In Proceedings of the 31st ACM International Conference on Multimedia, 2023.
- Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12873–12883, 2021.
- Conformer: Convolution-augmented transformer for speech recognition. Interspeech, 2020.
- Ad-nerf: Audio driven neural radiance fields for talking head synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 5784–5794, 2021.
- Spacex: Speech-driven portrait animation with controllable expression. arXiv preprint arXiv:2211.09809, 2022.
- Towards measuring fairness in ai: the casual conversations dataset. IEEE Transactions on Biometrics, Behavior, and Identity Science, 4(3):324–332, 2021.
- Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in Neural Information Processing Systems (NeurIPS), 30, 2017.
- Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems (NeurIPS), 33:6840–6851, 2020.
- Audio-driven emotional video portraits. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 14080–14089, 2021.
- Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015.
- Auto-encoding variational bayes. In International Conference on Learning Representations (ICLR), 2014.
- Lipsync3d: Data-efficient learning of personalized 3d talking faces from video using pose and lighting normalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2755–2764, 2021.
- Stableface: Analyzing and improving motion stability for talking face generation. arXiv preprint arXiv:2208.13717, 2022.
- Audio-driven co-speech gesture video generation. Advances in Neural Information Processing Systems (NeurIPS), 35:21386–21399, 2022.
- Live speech portraits: real-time photorealistic talking-head animation. ACM Transactions on Graphics (TOG), 40(6):1–17, 2021.
- The casual conversations v2 dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10–17, 2023.
- A lip sync expert is all you need for speech to lip generation in the wild. In Proceedings of the 28th ACM International Conference on Multimedia, pp. 484–492, 2020.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning (ICML), pp. 8748–8763. PMLR, 2021.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
- Pirenderer: Controllable portrait image generation via semantic neural rendering. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 13759–13768, 2021.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10684–10695, 2022.
- Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems (NeurIPS), 35:36479–36494, 2022.
- Difftalk: Crafting diffusion models for generalized audio-driven portraits animation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1982–1991, 2023.
- First order motion model for image animation. Advances in Neural Information Processing Systems (NeurIPS), 32, 2019.
- Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations (ICLR), 2021.
- Diffused heads: Diffusion models beat gans on talking-face generation. arXiv preprint arXiv:2301.03396, 2023.
- Memories are one-to-many mapping alleviators in talking face generation. arXiv preprint arXiv:2212.05005, 2022.
- Neural voice puppetry: Audio-driven facial reenactment. In European Conference on Computer Vision (ECCV), pp. 716–731. Springer, 2020.
- Attention is all you need. Advances in Neural Information Processing Systems (NeurIPS), 30, 2017.
- Progressive disentangled representation learning for fine-grained controllable talking head synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 17979–17989, 2023.
- Audio2head: Audio-driven one-shot talking-head generation with natural head motion. In International Joint Conference on Artificial Intelligence. IJCAI, 2021a.
- One-shot talking face generation from single-speaker audio-visual correlation learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp. 2531–2539, 2022.
- One-shot free-view neural talking-head synthesis for video conferencing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10039–10049, 2021b.
- Fake it till you make it: face analysis in the wild using synthetic data alone. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3681–3691, 2021.
- Talking head generation with probabilistic audio-to-visual diffusion priors. arXiv preprint arXiv:2212.04248, 2022.
- Metaportrait: Identity-preserving talking head generation with fast personalized adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 22096–22105, 2023a.
- The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 586–595, 2018.
- Sadtalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8652–8661, 2023b.
- Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3661–3670, 2021.
- Identity-preserving talking face generation with landmark and appearance priors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9729–9738, 2023.
- Pose-controllable talking face generation by implicitly modularized audio-visual representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4176–4186, 2021.
- Makelttalk: speaker-aware talking-head animation. ACM Transactions On Graphics (TOG), 39(6):1–15, 2020.
- Tianyu He (52 papers)
- Junliang Guo (39 papers)
- Runyi Yu (13 papers)
- Yuchi Wang (11 papers)
- Jialiang Zhu (4 papers)
- Kaikai An (15 papers)
- Leyi Li (3 papers)
- Xu Tan (164 papers)
- Chunyu Wang (43 papers)
- Han Hu (196 papers)
- HsiangTao Wu (8 papers)
- Sheng Zhao (75 papers)
- Jiang Bian (229 papers)