Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Style2Talker: High-Resolution Talking Head Generation with Emotion Style and Art Style (2403.06365v2)

Published 11 Mar 2024 in cs.CV

Abstract: Although automatically animating audio-driven talking heads has recently received growing interest, previous efforts have mainly concentrated on achieving lip synchronization with the audio, neglecting two crucial elements for generating expressive videos: emotion style and art style. In this paper, we present an innovative audio-driven talking face generation method called Style2Talker. It involves two stylized stages, namely Style-E and Style-A, which integrate text-controlled emotion style and picture-controlled art style into the final output. In order to prepare the scarce emotional text descriptions corresponding to the videos, we propose a labor-free paradigm that employs large-scale pretrained models to automatically annotate emotional text labels for existing audiovisual datasets. Incorporating the synthetic emotion texts, the Style-E stage utilizes a large-scale CLIP model to extract emotion representations, which are combined with the audio, serving as the condition for an efficient latent diffusion model designed to produce emotional motion coefficients of a 3DMM model. Moving on to the Style-A stage, we develop a coefficient-driven motion generator and an art-specific style path embedded in the well-known StyleGAN. This allows us to synthesize high-resolution artistically stylized talking head videos using the generated emotional motion coefficients and an art style source picture. Moreover, to better preserve image details and avoid artifacts, we provide StyleGAN with the multi-scale content features extracted from the identity image and refine its intermediate feature maps by the designed content encoder and refinement network, respectively. Extensive experimental results demonstrate our method outperforms existing state-of-the-art methods in terms of audio-lip synchronization and performance of both emotion style and art style.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. Image2stylegan: How to embed images into the stylegan latent space? In Proceedings of the IEEE/CVF International Conference on Computer Vision, 4432–4441.
  2. Talking Head from Speech Audio using a Pre-trained Image Generator. In Proceedings of the 30th ACM International Conference on Multimedia, 5228–5236.
  3. Openface 2.0: Facial behavior analysis toolkit. In 2018 13th IEEE international conference on automatic face & gesture recognition (FG 2018), 59–66. IEEE.
  4. A morphable model for the synthesis of 3D faces. In Proceedings of the 26th annual conference on Computer graphics and interactive techniques, 187–194.
  5. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877–1901.
  6. Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 7832–7841.
  7. Stargan v2: Diverse image synthesis for multiple domains. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 8188–8197.
  8. Out of Time: Automated Lip Sync in the Wild. Springer International Publishing eBooks.
  9. Accurate 3d face reconstruction with weakly-supervised learning: From single image to image set. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, 0–0.
  10. Diffusion Models Beat GANs on Image Synthesis. Neural Information Processing Systems.
  11. Avatars grow legs: Generating smooth human motion from sparse tracking inputs with diffusion model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 481–490.
  12. Facial action coding system. Environmental Psychology & Nonverbal Behavior.
  13. What the face reveals : basic and applied studies of spontaneous expression using the facial action coding system (FACS).
  14. Ad-nerf: Audio driven neural radiance fields for talking head synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 5784–5794.
  15. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. Neural Information Processing Systems,Neural Information Processing Systems.
  16. Denoising Diffusion Probabilistic Models. Neural Information Processing Systems.
  17. Face tells detailed expression: Generating comprehensive facial expression sentence through facial action units. In MultiMedia Modeling: 26th International Conference, MMM 2020, Daejeon, South Korea, January 5–8, 2020, Proceedings, Part II 26, 100–111. Springer.
  18. Arbitrary Style Transfer in Real-time with Adaptive Instance Normalization. In 2017 IEEE International Conference on Computer Vision (ICCV).
  19. Variation Robust Cross-Modal Metric Learning for Caricature Recognition. In Proceedings of the on Thematic Workshops of ACM Multimedia 2017.
  20. WebCaricature: a benchmark for caricature recognition. Computer Vision and Pattern Recognition,Computer Vision and Pattern Recognition.
  21. Eamm: One-shot emotional talking face via audio-based emotion-aware motion model. In ACM SIGGRAPH 2022 Conference Proceedings, 1–10.
  22. Audio-driven emotional video portraits. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 14080–14089.
  23. Perceptual Losses for Real-Time Style Transfer and Super-Resolution, 694–711.
  24. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 4401–4410.
  25. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 8110–8119.
  26. TalkCLIP: Talking Head Generation with Text-Guided Expressive Speaking Styles. arXiv preprint arXiv:2304.00334.
  27. StyleTalk: One-shot Talking Head Generation with Controllable Speaking Styles. arXiv preprint arXiv:2301.01081.
  28. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784.
  29. Improved Denoising Diffusion Probabilistic Models. Cornell University - arXiv.
  30. Emotional Voice Puppetry. IEEE Transactions on Visualization and Computer Graphics, 29(5): 2527–2535.
  31. AI-generated characters for supporting personalized learning and well-being. Nature Machine Intelligence, 3(12): 1013–1022.
  32. A lip sync expert is all you need for speech to lip generation in the wild. In Proceedings of the 28th ACM International Conference on Multimedia, 484–492.
  33. Learning transferable visual models from natural language supervision. In International conference on machine learning, 8748–8763. PMLR.
  34. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125.
  35. Encoding in style: a stylegan encoder for image-to-image translation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2287–2296.
  36. Very Deep Convolutional Networks for Large-Scale Image Recognition. International Conference on Learning Representations,International Conference on Learning Representations.
  37. Emotion-Controllable Generalized Talking Face Generation. In International Joint Conference on Artificial Intelligence. IJCAI.
  38. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502.
  39. Diffused Heads: Diffusion Models Beat GANs on Talking-Face Generation. arXiv preprint arXiv:2301.03396.
  40. EMMN: Emotional Motion Memory Network for Audio-driven Emotional Talking Face Generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 22146–22156.
  41. Mead: A large-scale audio-visual dataset for emotional talking-face generation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI, 700–717. Springer.
  42. Audio2Head: Audio-driven One-shot Talking-head Generation with Natural Head Motion. In International Joint Conference on Artificial Intelligence. IJCAI.
  43. One-shot talking face generation from single-speaker audio-visual correlation learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, 2531–2539.
  44. High-fidelity gan inversion for image attribute editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11379–11388.
  45. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing.
  46. Pastiche master: exemplar-based high-resolution portrait style transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7693–7702.
  47. VToonify: Controllable High-Resolution Portrait Video Style Transfer. ACM Transactions on Graphics (TOG), 41(6): 1–15.
  48. Geneface: Generalized and high-fidelity audio-driven 3d talking face synthesis. arXiv preprint arXiv:2301.13430.
  49. StyleHEAT: One-shot high-resolution editable talking face generation via pre-trained StyleGAN. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XVII, 85–101. Springer.
  50. SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation. arXiv preprint arXiv:2211.12194.
  51. Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3661–3670.
  52. Pose-controllable talking face generation by implicitly modularized audio-visual representation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 4176–4186.
  53. Makelttalk: speaker-aware talking-head animation. ACM Transactions On Graphics (TOG), 39(6): 1–15.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Shuai Tan (14 papers)
  2. Bin Ji (28 papers)
  3. Ye Pan (15 papers)
Citations (13)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com