Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Say Anything with Any Style (2403.06363v2)

Published 11 Mar 2024 in cs.CV

Abstract: Generating stylized talking head with diverse head motions is crucial for achieving natural-looking videos but still remains challenging. Previous works either adopt a regressive method to capture the speaking style, resulting in a coarse style that is averaged across all training data, or employ a universal network to synthesize videos with different styles which causes suboptimal performance. To address these, we propose a novel dynamic-weight method, namely Say Anything withAny Style (SAAS), which queries the discrete style representation via a generative model with a learned style codebook. Specifically, we develop a multi-task VQ-VAE that incorporates three closely related tasks to learn a style codebook as a prior for style extraction. This discrete prior, along with the generative model, enhances the precision and robustness when extracting the speaking styles of the given style clips. By utilizing the extracted style, a residual architecture comprising a canonical branch and style-specific branch is employed to predict the mouth shapes conditioned on any driving audio while transferring the speaking style from the source to any desired one. To adapt to different speaking styles, we steer clear of employing a universal network by exploring an elaborate HyperStyle to produce the style-specific weights offset for the style branch. Furthermore, we construct a pose generator and a pose codebook to store the quantized pose representation, allowing us to sample diverse head motions aligned with the audio and the extracted style. Experiments demonstrate that our approach surpasses state-of-theart methods in terms of both lip-synchronization and stylized expression. Besides, we extend our SAAS to video-driven style editing field and achieve satisfactory performance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. Talking Head from Speech Audio using a Pre-trained Image Generator. In Proceedings of the 30th ACM International Conference on Multimedia, 5228–5236.
  2. A morphable model for the synthesis of 3D faces. In Proceedings of the 26th annual conference on Computer graphics and interactive techniques, 187–194.
  3. Talking-head generation with rhythmic head motion. In European Conference on Computer Vision, 35–51. Springer.
  4. Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 7832–7841.
  5. Improving Few-Shot Learning for Talking Face System with TTS Data Augmentation. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1–5. IEEE.
  6. Dynamic convolution: Attention over convolution kernels. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 11030–11039.
  7. Out of time: automated lip sync in the wild. In Computer Vision–ACCV 2016 Workshops: ACCV 2016 International Workshops, Taipei, Taiwan, November 20-24, 2016, Revised Selected Papers, Part II 13, 251–263. Springer.
  8. Accurate 3d face reconstruction with weakly-supervised learning: From single image to image set. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, 0–0.
  9. The challenge of realistic music generation: modelling raw audio at scale. Advances in Neural Information Processing Systems, 31.
  10. Triplet loss in siamese network for object tracking. In Proceedings of the European conference on computer vision (ECCV), 459–474.
  11. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
  12. What the face reveals : basic and applied studies of spontaneous expression using the facial action coding system (FACS).
  13. Faigin, G. 1990. The Artist’s Complete Guide to Facial Expression.
  14. Hypernetworks. arXiv preprint arXiv:1609.09106.
  15. Eamm: One-shot emotional talking face via audio-based emotion-aware motion model. In ACM SIGGRAPH 2022 Conference Proceedings, 1–10.
  16. Audio-driven emotional video portraits. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 14080–14089.
  17. Practical single-image super-resolution using look-up table. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 691–700.
  18. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  19. Write-a-speaker: Text-based emotional and rhythmic talking-head generation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, 1911–1920.
  20. Expressive talking head generation with granular audio-visual control. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3387–3396.
  21. StyleTalk: One-shot Talking Head Generation with Controllable Speaking Styles. arXiv preprint arXiv:2301.01081.
  22. Few-shot learning of homogeneous human locomotion styles. In Computer Graphics Forum, volume 37, 143–153. Wiley Online Library.
  23. Learning to listen: Modeling non-deterministic dyadic facial motion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 20395–20405.
  24. Synctalkface: Talking face generation with precise lip-syncing via audio-lip memory. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, 2062–2070.
  25. AI-generated characters for supporting personalized learning and well-being. Nature Machine Intelligence, 3(12): 1013–1022.
  26. A lip sync expert is all you need for speech to lip generation in the wild. In Proceedings of the 28th ACM International Conference on Multimedia, 484–492.
  27. Learning multiple visual domains with residual adapters. Advances in neural information processing systems, 30.
  28. Pirenderer: Controllable portrait image generation via semantic neural rendering. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 13759–13768.
  29. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, 234–241. Springer.
  30. Self-attention encoding and pooling for speaker recognition. arXiv preprint arXiv:2008.01077.
  31. Seitzer, M. 2020. pytorch-fid: FID Score for PyTorch. https://github.com/mseitzer/pytorch-fid. Version 0.3.0.
  32. Emotion-Controllable Generalized Talking Face Generation. In International Joint Conference on Artificial Intelligence. IJCAI.
  33. Continuously Controllable Facial Expression Editing in Talking Face Videos. arXiv preprint arXiv:2209.08289.
  34. EMMN: Emotional Motion Memory Network for Audio-driven Emotional Talking Face Generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 22146–22156.
  35. Style-ERD: responsive and coherent online motion style transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6593–6603.
  36. Audio2face: Generating speech/face animation from single audio with attention-based bidirectional lstm networks. In 2019 IEEE international conference on Multimedia & Expo Workshops (ICMEW), 366–371. IEEE.
  37. Neural discrete representation learning. Advances in neural information processing systems, 30.
  38. Realistic speech-driven facial animation with gans. International Journal of Computer Vision, 128(5): 1398–1413.
  39. Mead: A large-scale audio-visual dataset for emotional talking-face generation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI, 700–717. Springer.
  40. Audio2Head: Audio-driven One-shot Talking-head Generation with Natural Head Motion. In International Joint Conference on Artificial Intelligence. IJCAI.
  41. One-shot talking face generation from single-speaker audio-visual correlation learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, 2531–2539.
  42. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing.
  43. CodeTalker: Speech-Driven 3D Facial Animation with Discrete Motion Prior. arXiv preprint arXiv:2301.02379.
  44. Audio-driven talking face video generation with dynamic convolution kernels. IEEE Transactions on Multimedia.
  45. Adaptive convolutional kernels. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, 0–0.
  46. 3d talking face with personalized pose dynamics. IEEE Transactions on Visualization and Computer Graphics.
  47. SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation. arXiv preprint arXiv:2211.12194.
  48. DINet: Deformation Inpainting Network for Realistic Face Visually Dubbing on High Resolution Video.
  49. Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3661–3670.
  50. Talking face generation by adversarially disentangled audio-visual representation. In Proceedings of the AAAI conference on artificial intelligence, volume 33, 9299–9306.
  51. Pose-controllable talking face generation by implicitly modularized audio-visual representation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 4176–4186.
  52. MakeIttalk: speaker-aware talking-head animation. ACM Transactions On Graphics (TOG), 39(6): 1–15.
Citations (7)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com