Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Dubbing for Everyone: Data-Efficient Visual Dubbing using Neural Rendering Priors (2401.06126v1)

Published 11 Jan 2024 in cs.CV and cs.GR

Abstract: Visual dubbing is the process of generating lip motions of an actor in a video to synchronise with given audio. Recent advances have made progress towards this goal but have not been able to produce an approach suitable for mass adoption. Existing methods are split into either person-generic or person-specific models. Person-specific models produce results almost indistinguishable from reality but rely on long training times using large single-person datasets. Person-generic works have allowed for the visual dubbing of any video to any audio without further training, but these fail to capture the person-specific nuances and often suffer from visual artefacts. Our method, based on data-efficient neural rendering priors, overcomes the limitations of existing approaches. Our pipeline consists of learning a deferred neural rendering prior network and actor-specific adaptation using neural textures. This method allows for $\textbf{high-quality visual dubbing with just a few seconds of data}$, that enables video dubbing for any actor - from A-list celebrities to background actors. We show that we achieve state-of-the-art in terms of $\textbf{visual quality}$ and $\textbf{recognisability}$ both quantitatively, and qualitatively through two user studies. Our prior learning and adaptation method $\textbf{generalises to limited data}$ better and is more $\textbf{scalable}$ than existing person-specific models. Our experiments on real-world, limited data scenarios find that our model is preferred over all others. The project page may be found at https://dubbingforeveryone.github.io/

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. Youtube: European central bank. https://www.youtube.com/@ecbeuro/videos. Accessed: 2023-07-11.
  2. Eleven labs dubbing. https://elevenlabs.io/dubbing. Accessed: 2023-06-11.
  3. A Morphable Model For The Synthesis Of 3D Faces. Association for Computing Machinery, New York, NY, USA, 1 edition, 2023.
  4. Video rewrite: Driving visual speech with audio. In Proceedings of the 24th Annual Conference on Computer Graphics and Interactive Techniques, page 353–360, USA, 1997. ACM Press/Addison-Wesley Publishing Co.
  5. Out of time: automated lip sync in the wild. In Workshop on Multi-view Lip-reading, ACCV, 2016.
  6. EMOCA: Emotion driven monocular face capture and animation. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 20311–20322, 2022.
  7. Retinaface: Single-shot multi-level face localisation in the wild. In CVPR, 2020.
  8. Dae-talker: High fidelity speech-driven talking face generation with diffusion autoencoder. arXiv preprint arXiv:2303.17550, 2023.
  9. 3d morphable face models—past, present, and future. ACM Trans. Graph., 2020.
  10. Trainable videorealistic speech animation. ACM Trans. Graph., 21(3):388–398, 2002.
  11. Learning an animatable detailed 3D face model from in-the-wild images. 2021.
  12. Visual speech-aware perceptual 3d facial expression reconstruction from videos. arXiv preprint arXiv:2207.11094, 2022.
  13. Neural head avatars from monocular rgb videos. arXiv preprint arXiv:2112.01554, 2021.
  14. Stylesync: High-fidelity generalized and personalized lip sync in style-based generator. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  15. Ad-nerf: Audio driven neural radiance fields for talking head synthesis. In IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
  16. Towards generating ultra-high resolution talking-face videos with lip synchronization. In 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 5198–5207, 2023.
  17. Space: Speech-driven portrait animation with controllable expression. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 20914–20923, 2023.
  18. Audio-driven emotional video portraits. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  19. Perceptual losses for real-time style transfer and super-resolution. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, pages 694–711. Springer, 2016.
  20. Towards automatic face-to-face translation. In Proceedings of the 27th ACM International Conference on Multimedia, 2019.
  21. Analyzing and improving the image quality of StyleGAN. In Proc. CVPR, 2020.
  22. Deep video portraits. ACM Transactions on Graphics (TOG), 37(4):163, 2018.
  23. Neural style-preserving visual dubbing. ACM Transactions on Graphics (TOG), 38(6):178:1–13, 2019.
  24. Learning a model of facial shape and expression from 4D scans. ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), 36(6):194:1–194:17, 2017.
  25. Live Speech Portraits: Real-time photorealistic talking-head animation. ACM Transactions on Graphics, 40(6), 2021.
  26. Mediapipe: A framework for perceiving and processing reality. In Third Workshop on Computer Vision for AR/VR at IEEE Computer Vision and Pattern Recognition (CVPR) 2019, 2019.
  27. Styletalk: One-shot talking head generation with controllable speaking styles. In Proceedings of the Association for the Advancement of Artificial Intelligence (AAAI), 2023.
  28. Least squares generative adversarial networks. In Proceedings of the IEEE international conference on computer vision, pages 2794–2802, 2017.
  29. Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020.
  30. The creation and detection of deepfakes: A survey. ACM Computing Surveys (CSUR), 54(1):1–41, 2021.
  31. Dwt-dct-svd based watermarking. In 2008 3rd International Conference on Communication Systems Software and Middleware and Workshops (COMSWARE ’08), pages 271–274, 2008.
  32. A lip sync expert is all you need for speech to lip generation in the wild. In Proceedings of the 28th ACM International Conference on Multimedia, page 484–492, New York, NY, USA, 2020. Association for Computing Machinery.
  33. FaceForensics++: Learning to detect manipulated facial images. In International Conference on Computer Vision (ICCV), 2019.
  34. Read avatars: Realistic emotion-controllable audio driven avatars. In British Machine Vision Conference (BMVC), 2023.
  35. Learning dynamic facial radiance fields for few-shot talking head synthesis. 2022.
  36. Difftalk: Crafting diffusion models for generalized audio-driven portraits animation. In CVPR, 2023.
  37. First order motion model for image animation. In Conference on Neural Information Processing Systems (NeurIPS), 2019.
  38. Everybody’s talkin’: Let me talk as you want. arXiv preprint arxiv:2001.05201, 2020.
  39. Diffused Heads: Diffusion Models Beat GANs on Talking-Face Generation. In https://arxiv.org/abs/2301.03396, 2023.
  40. Masked lip-sync prediction by audio-visual contextual exploitation in transformers. In SIGGRAPH Asia 2022 Conference Papers, New York, NY, USA, 2022. Association for Computing Machinery.
  41. Synthesizing obama: Learning lip sync from audio. ACM Trans. Graph., 36(4), 2017.
  42. Real-time neural radiance talking portrait synthesis via audio-spatial decomposition. arXiv preprint arXiv:2211.12368, 2022.
  43. Imitator: Personalized speech-driven 3d facial animation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 20621–20631, 2023.
  44. Face2face: Real-time face capture and reenactment of rgb videos. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2387–2395, 2016.
  45. Deferred neural rendering: Image synthesis using neural textures. ACM Transactions on Graphics (TOG), 38(4):1–12, 2019.
  46. Neural voice puppetry: Audio-driven facial reenactment. ECCV 2020, 2020.
  47. Seeing what you said: Talking face generation guided by a lip reading expert. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14653–14662, 2023a.
  48. Lipformer: High-fidelity and generalizable talking face generation with a pre-learned facial codebook. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13844–13853, 2023b.
  49. Mead: A large-scale audio-visual dataset for emotional talking-face generation. In ECCV, 2020.
  50. Photorealistic audio-driven video portraits. IEEE Transactions on Visualization and Computer Graphics, 26(12):3457–3466, 2020.
  51. Geneface: Generalized and high-fidelity audio-driven 3d talking face synthesis. International Conference on Learning Representations, 2023.
  52. Bisenet v2: Bilateral network with guided aggregation for real-time semantic segmentation. International Journal of Computer Vision, 129:3051–3068, 2021.
  53. Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3661–3670, 2021.
  54. Makeittalk: Speaker-aware talking-head animation. ACM Transactions on Graphics, 39(6), 2020.
  55. Towards metrical reconstruction of human faces. In European Conference on Computer Vision (ECCV). Springer International Publishing, 2022.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets