DREAM-Talk: Diffusion-based Realistic Emotional Audio-driven Method for Single Image Talking Face Generation (2312.13578v1)
Abstract: The generation of emotional talking faces from a single portrait image remains a significant challenge. The simultaneous achievement of expressive emotional talking and accurate lip-sync is particularly difficult, as expressiveness is often compromised for the accuracy of lip-sync. As widely adopted by many prior works, the LSTM network often fails to capture the subtleties and variations of emotional expressions. To address these challenges, we introduce DREAM-Talk, a two-stage diffusion-based audio-driven framework, tailored for generating diverse expressions and accurate lip-sync concurrently. In the first stage, we propose EmoDiff, a novel diffusion module that generates diverse highly dynamic emotional expressions and head poses in accordance with the audio and the referenced emotion style. Given the strong correlation between lip motion and audio, we then refine the dynamics with enhanced lip-sync accuracy using audio features and emotion style. To this end, we deploy a video-to-video rendering module to transfer the expressions and lip motions from our proxy 3D avatar to an arbitrary portrait. Both quantitatively and qualitatively, DREAM-Talk outperforms state-of-the-art methods in terms of expressiveness, lip-sync accuracy and perceptual quality.
- Glean: Generative latent bank for large-factor image super-resolution. In CVPR, 2021.
- Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In CVPR, 2019.
- Out of time: automated lip sync in the wild. In ACCV, 2016.
- You said that? arXiv, 2017.
- VoxCeleb2: Deep Speaker Recognition. In Proceedings of Interspeech, 2018.
- ExprGAN: Facial expression editing with controllable expression intensity. In AAAI, 2018.
- Speech driven talking face generation from a single image and an emotion condition. IEEE TMM, 2021.
- Efficient emotional adaptation for audio-driven talking-head generation. In ICCV, 2023.
- AD-NERF: Audio driven neural radiance fields for talking head synthesis. In ICCV, 2021.
- SPACE: Speech-driven portrait animation with controllable expression. arXiv, 2022.
- Classifier-free diffusion guidance. arXiv, 2022.
- Denoising diffusion probabilistic models. NeurIPS, 2020.
- Audio-driven emotional video portraits. In CVPR, 2021.
- EAMM: One-shot emotional talking face via audio-based emotion-aware motion model. In SIGGRAPH Conference, 2022.
- Adam: A method for stochastic optimization. In ICLR, 2015.
- Write-a-speaker: Text-based emotional and rhythmic talking-head generation. In AAAI, 2021.
- Learning a model of facial shape and expression from 4d scans. ACM TOG, 2017.
- Expressive talking head generation with granular audio-visual control. In CVPR, 2022.
- Beat: A large-scale semantic and emotional multi-modal dataset for conversational gestures synthesis. In ECCV, 2022.
- The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in north american english. 2018.
- Styletalk: One-shot talking head generation with controllable speaking styles. arXiv, 2023.
- Learned spatial representations for few-shot talking-head synthesis. In ICCV, 2021.
- Emotalk: Speech-driven emotional disentanglement for 3d face animation. In ICCV, 2023.
- Difftalk: Crafting diffusion models for generalized audio-driven portraits animation. In CVPR, 2023.
- First order motion model for image animation. In NeurIPS, 2019.
- Emotion-controllable generalized talking face generation. arXiv, 2022.
- Everybody’s talkin’: Let me talk as you want. IEEE Transactions on Information Forensics and Security, 2022.
- Diffused heads: Diffusion models beat gans on talking-face generation. arXiv, 2023.
- Synthesizing obama: Learning lip sync from audio. ACM TOG, 2017.
- Emmn: Emotional motion memory network for audio-driven emotional talking face generation. In ICCV, 2023.
- A deep learning approach for generalized speech animation. ACM TOG, 2017.
- EDGE: Editable dance generation from music. arXiv, 2022.
- MEAD: A large-scale audio-visual dataset for emotional talking-face generation. In ECCV, 2020.
- One-shot free-view neural talking-head synthesis for video conferencing. In CVPR, 2021.
- High-fidelity generalized emotional talking face generation with multi-modal emotion space learning. In CVPR, 2023.
- Geneface: Generalized and high-fidelity audio-driven 3D talking face synthesis. arXiv, 2023.
- FACIAL: Synthesizing dynamic talking face with implicit attribute learning. In ICCV, 2021a.
- Sadtalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation. In CVPR, 2023.
- Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset. In CVPR, 2021b.
- Identity-preserving talking face generation with landmark and appearance priors. arXiv, 2023.
- MakeItTalk: Speaker-aware talking-head animation. ACM TOG, 2020.
- Taming diffusion models for audio-driven co-speech gesture generation. arXiv, 2023.