Papers
Topics
Authors
Recent
Search
2000 character limit reached

EMOPortraits: Emotion-enhanced Multimodal One-shot Head Avatars

Published 29 Apr 2024 in cs.CV | (2404.19110v1)

Abstract: Head avatars animated by visual signals have gained popularity, particularly in cross-driving synthesis where the driver differs from the animated character, a challenging but highly practical approach. The recently presented MegaPortraits model has demonstrated state-of-the-art results in this domain. We conduct a deep examination and evaluation of this model, with a particular focus on its latent space for facial expression descriptors, and uncover several limitations with its ability to express intense face motions. To address these limitations, we propose substantial changes in both training pipeline and model architecture, to introduce our EMOPortraits model, where we: Enhance the model's capability to faithfully support intense, asymmetric face expressions, setting a new state-of-the-art result in the emotion transfer task, surpassing previous methods in both metrics and quality. Incorporate speech-driven mode to our model, achieving top-tier performance in audio-driven facial animation, making it possible to drive source identity through diverse modalities, including visual signal, audio, or a blend of both. We propose a novel multi-view video dataset featuring a wide range of intense and asymmetric facial expressions, filling the gap with absence of such data in existing datasets.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. Hyperreenact: One-shot reenactment via jointly learning to refine and retarget faces, 2023.
  2. Neural head reenactment with latent pose descriptors. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2020.
  3. Crema-d: Crowd-sourced emotional multimodal actors dataset. IEEE transactions on affective computing, 5(4):377–390, 2014.
  4. Vggface2: A dataset for recognising faces across pose and age, 2018.
  5. Out of time: automated lip sync in the wild. In Computer Vision–ACCV 2016 Workshops: ACCV 2016 International Workshops, Taipei, Taiwan, November 20-24, 2016, Revised Selected Papers, Part II 13, pages 251–263. Springer, 2017.
  6. Voxceleb2: Deep speaker recognition. In INTERSPEECH, 2018.
  7. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
  8. Headgan: One-shot neural head synthesis and editing, 2021.
  9. Megaportraits: One-shot megapixel neural head avatars, 2023.
  10. Rt-gene: Real-time eye gaze estimation in natural environments. In ECCV, 2018.
  11. Dynamic neural radiance fields for monocular 4d facial avatar reconstruction, 2020.
  12. Neural head avatars from monocular rgb videos, 2022.
  13. Improved training of wasserstein gans, 2017.
  14. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, 2017.
  15. Image-to-image translation with conditional adversarial networks, 2018.
  16. Surrey audio-visual expressed emotion (savee) database. University of Surrey: Guildford, UK, 2014.
  17. Realistic one-shot mesh-based head avatars, 2022.
  18. Understanding collapse in non-contrastive siamese representation learning, 2022.
  19. Generalizable one-shot neural head avatar, 2023.
  20. The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english. PloS one, 13(5):e0196391, 2018.
  21. Decoupled weight decay regularization. In ICLR, 2019.
  22. The extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-specified expression. In 2010 ieee computer society conference on computer vision and pattern recognition-workshops, pages 94–101. IEEE, 2010.
  23. Web-based database for facial expression analysis. In 2005 IEEE international conference on multimedia and Expo, pages 5–pp. IEEE, 2005.
  24. Deep face recognition. In BMVC, 2015.
  25. A lip sync expert is all you need for speech to lip generation in the wild. In Proceedings of the 28th ACM international conference on multimedia, pages 484–492, 2020.
  26. Robust speech recognition via large-scale weak supervision, 2022.
  27. First order motion model for image animation. ArXiv, abs/2003.00196, 2019.
  28. Unsupervised volumetric animation. arXiv preprint arXiv:2301.11326, 2023.
  29. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2015.
  30. Diffused heads: Diffusion models beat gans on talking-face generation, 2023.
  31. Are 3d face shapes expressive enough for recognising continuous emotions and action unit intensities?, 2023.
  32. Cosface: Large margin cosine loss for deep face recognition, 2018a.
  33. Mead: A large-scale audio-visual dataset for emotional talking-face generation. In European Conference on Computer Vision, pages 700–717. Springer, 2020.
  34. High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018b.
  35. One-shot free-view neural talking-head synthesis for video conferencing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021.
  36. AvatarMAV: Fast 3d head avatar reconstruction using motion-aware neural voxels. In Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Proceedings. ACM, 2023.
  37. Styleheat: One-shot high-resolution editable talking face generation via pre-trained stylegan, 2022.
  38. Bisenet: Bilateral segmentation network for real-time semantic segmentation, 2018.
  39. Nofa: Nerf-based one-shot facial avatar reconstruction, 2023.
  40. Few-shot adversarial learning of realistic neural talking head models, 2019.
  41. Fast bi-layer neural synthesis of one-shot realistic head avatars, 2020.
  42. Metaportrait: Identity-preserving talking head generation with fast personalized adaptation, 2023a.
  43. Sadtalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation, 2023b.
  44. Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3661–3670, 2021.
  45. I m avatar: Implicit morphable head avatars from videos, 2022.
  46. Pose-controllable talking face generation by implicitly modularized audio-visual representation, 2021.
  47. On the continuity of rotation representations in neural networks. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5738–5746, 2019.
  48. MakeltTalk. ACM Transactions on Graphics, 39(6):1–15, 2020.
  49. Unpaired image-to-image translation using cycle-consistent adversarial networks. 2017 IEEE International Conference on Computer Vision (ICCV), pages 2242–2251, 2017.
Citations (12)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.