DiT-Head: High-Resolution Talking Head Synthesis using Diffusion Transformers (2312.06400v1)
Abstract: We propose a novel talking head synthesis pipeline called "DiT-Head", which is based on diffusion transformers and uses audio as a condition to drive the denoising process of a diffusion model. Our method is scalable and can generalise to multiple identities while producing high-quality results. We train and evaluate our proposed approach and compare it against existing methods of talking head synthesis. We show that our model can compete with these methods in terms of visual quality and lip-sync accuracy. Our results highlight the potential of our proposed approach to be used for a wide range of applications, including virtual assistants, entertainment, and education. For a video demonstration of the results and our user study, please refer to our supplementary material.
- wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449–12460.
- Speech driven video editing via an audio-conditioned diffusion model. arXiv preprint arXiv:2301.04474.
- A morphable model for the synthesis of 3d faces. In Proceedings of the 26th annual conference on Computer graphics and interactive techniques, pages 187–194.
- Align your latents: High-resolution video synthesis with latent diffusion models. arXiv preprint arXiv:2304.08818.
- How far are we from solving the 2d & 3d face alignment problem? (and a dataset of 230,000 3d facial landmarks). In International Conference on Computer Vision.
- Talking-head generation with rhythmic head motion. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IX, pages 35–51. Springer.
- Hierarchical cross-modal talking face generationwith dynamic pixel-wise loss. arXiv preprint arXiv:1905.03820.
- Unifying vision-and-language tasks via text generation. In International Conference on Machine Learning, pages 1931–1942. PMLR.
- Out of time: automated lip sync in the wild. In Workshop on Multi-view Lip-reading, ACCV.
- On the relationship between self-attention and convolutional layers. arXiv preprint arXiv:1911.03584.
- FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems.
- Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34:8780–8794.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
- Headgan: One-shot neural head synthesis and editing. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14398–14407.
- Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883.
- Geyer, C. J. (1992). Practical markov chain monte carlo. Statistical science, pages 473–483.
- Ad-nerf: Audio driven neural radiance fields for talking head synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12388–12397.
- Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30.
- Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851.
- Rife: Real-time intermediate flow estimation for video frame interpolation. arXiv preprint arXiv:2011.06294.
- Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1125–1134.
- The threat of deepfakes to computer and human visions. In Handbook of Digital Face Manipulation and Detection: From DeepFakes to Morphing Attacks, pages 97–115. Springer International Publishing Cham.
- Experimental comparison of psnr and ssim metrics for video quality estimation. In ICT Innovations 2009, pages 357–366. Springer.
- Audioldm: Text-to-audio generation with latent diffusion models. arXiv preprint arXiv:2301.12503.
- Latent diffusion for language generation. arXiv preprint arXiv:2212.09462.
- Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106.
- Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784.
- Unified application of style transfer for face swapping and reenactment. In Proceedings of the Asian Conference on Computer Vision (ACCV).
- Scalable diffusion models with transformers. arXiv preprint arXiv:2212.09748.
- A lip sync expert is all you need for speech to lip generation in the wild. In Proceedings of the 28th ACM International Conference on Multimedia, pages 484–492.
- High-resolution image synthesis with latent diffusion models.
- U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer.
- Learning dynamic facial radiance fields for few-shot talking head synthesis. In European conference on computer vision.
- Difftalk: Crafting diffusion models for generalized talking head synthesis. arXiv preprint arXiv:2301.03786.
- Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502.
- Histoire des réseaux de neurones et du deep learning en traitement des signaux et des images. working paper or preprint.
- Neural discrete representation learning. Advances in neural information processing systems, 30.
- Attention is all you need. Advances in neural information processing systems, 30.
- Talking faces: Audio-to-video face generation. In Handbook of Digital Face Manipulation and Detection: From DeepFakes to Morphing Attacks, pages 163–188. Springer International Publishing Cham.
- Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612.
- Multimodal-driven talking face generation, face swapping, diffusion model. arXiv preprint arXiv:2305.02594.
- Geneface: Generalized and high-fidelity audio-driven 3d talking face synthesis. arXiv preprint arXiv:2301.13430.
- Video probabilistic diffusion models in projected latent space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
- Few-shot adversarial learning of realistic neural talking head models. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9459–9468.
- The unreasonable effectiveness of deep features as a perceptual metric. In CVPR.
- S3fd: Single shot scale-invariant face detector. In Proceedings of the IEEE international conference on computer vision, pages 192–201.
- Meta talk: Learning to data-efficiently generate audio-driven lip-synchronized talking face with high definition. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4848–4852. IEEE.
- Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3661–3670.
- Talking face generation by adversarially disentangled audio-visual representation. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 9299–9306.
- Makelttalk: speaker-aware talking-head animation. ACM Transactions On Graphics (TOG), 39(6):1–15.