4D Facial Expression Diffusion Model (2303.16611v2)
Abstract: Facial expression generation is one of the most challenging and long-sought aspects of character animation, with many interesting applications. The challenging task, traditionally having relied heavily on digital craftspersons, remains yet to be explored. In this paper, we introduce a generative framework for generating 3D facial expression sequences (i.e. 4D faces) that can be conditioned on different inputs to animate an arbitrary 3D face mesh. It is composed of two tasks: (1) Learning the generative model that is trained over a set of 3D landmark sequences, and (2) Generating 3D mesh sequences of an input facial mesh driven by the generated landmark sequences. The generative model is based on a Denoising Diffusion Probabilistic Model (DDPM), which has achieved remarkable success in generative tasks of other domains. While it can be trained unconditionally, its reverse process can still be conditioned by various condition signals. This allows us to efficiently develop several downstream tasks involving various conditional generation, by using expression labels, text, partial sequences, or simply a facial geometry. To obtain the full mesh deformation, we then develop a landmark-guided encoder-decoder to apply the geometrical deformation embedded in landmarks on a given facial mesh. Experiments show that our model has learned to generate realistic, quality expressions solely from the dataset of relatively small size, improving over the state-of-the-art methods. Videos and qualitative comparisons with other methods can be found at \url{https://github.com/ZOUKaifeng/4DFM}.
- Juan Miguel Lopez Alcaraz and Nils Strodthoff. 2022. Diffusion-based time series imputation and forecasting with structured state space models. arXiv preprint arXiv:2208.09399 (2022).
- Structured denoising diffusion models in discrete state-spaces. Advances in Neural Information Processing Systems 34 (2021), 17981–17993.
- CVAE-GAN: fine-grained image generation through asymmetric training. In Proceedings of the IEEE international conference on computer vision. 2745–2754.
- Label-efficient semantic segmentation with diffusion models. arXiv preprint arXiv:2112.03126 (2021).
- High-Quality Passive Facial Performance Capture using Anchor Frames. ACM Trans. Graph. 30 (07 2011), 75. https://doi.org/10.1145/2010324.1964970
- Reanimating Faces in Images and Video. Computer Graphics Forum 22, 3 (2003), 641–650. https://doi.org/10.1111/1467-8659.t01-1-00712
- Neural 3d morphable models: Spiral convolutional networks for 3d shape representation learning and generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 7213–7222.
- Hamza Bouzid and Lahoucine Ballihi. 2022. Facial Expression Video Generation Based-On Spatio-temporal Convolutional GAN: FEV-GAN. Intelligent Systems with Applications (2022), 200139.
- Denoising Pretraining for Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4175–4186.
- Dan Casas and Miguel A Otaduy. 2018. Learning nonlinear soft-tissue dynamics for interactive avatars. Proceedings of the ACM on Computer Graphics and Interactive Techniques 1, 1 (2018), 1–15.
- Analog bits: Generating discrete data using diffusion models with self-conditioning. arXiv preprint arXiv:2208.04202 (2022).
- 4dfab: A large scale 4d database for facial expression analysis and biometric applications. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5117–5126.
- On the Properties of Neural Machine Translation: Encoder–Decoder Approaches. In SSST@EMNLP.
- Editing in Style: Uncovering the Local Semantics of GANs. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
- A FACS valid 3D dynamic action unit database with applications to 3D dynamic morphable facial modeling. In 2011 International Conference on Computer Vision. 2296–2303. https://doi.org/10.1109/ICCV.2011.6126510
- A FACS valid 3D dynamic action unit database with applications to 3D dynamic morphable facial modeling. In 2011 international conference on computer vision. IEEE, 2296–2303.
- D. DeCarlo and D. Metaxas. 1996. The Integration of Optical Flow and Deformable Models with Applications to Human Face Shape and Motion Estimation. In 2013 IEEE Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, Los Alamitos, CA, USA, 231. https://doi.org/10.1109/CVPR.1996.517079
- Accurate 3d face reconstruction with weakly-supervised learning: From single image to image set. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
- Prafulla Dhariwal and Alexander Nichol. 2021. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems 34 (2021), 8780–8794.
- Continuous diffusion for categorical data. arXiv preprint arXiv:2211.15089 (2022).
- Nice: Non-linear independent components estimation. arXiv preprint arXiv:1410.8516 (2014).
- Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research 12, 7 (2011).
- A 3-d audio-visual corpus of affective communication. IEEE Transactions on Multimedia 12, 6 (2010), 591–598.
- Joint 3d face reconstruction and dense alignment with position map regression network. In Proceedings of the European conference on computer vision (ECCV). 534–551.
- A dictionary learning-based 3D morphable shape model. IEEE Transactions on Multimedia 19, 12 (2017), 2666–2679.
- Diffuseq: Sequence to sequence text generation with diffusion models. arXiv preprint arXiv:2210.08933 (2022).
- Generative adversarial nets. In Advances in neural information processing systems. 2672–2680.
- Diffusion models as plug-and-play priors. arXiv preprint arXiv:2206.09012 (2022).
- Action2motion: Conditioned generation of 3d human motions. In Proc. ACM Multimedia. 2021–2029.
- Towards fast, accurate and stable 3d dense face alignment. In European Conference on Computer Vision. Springer, 152–168.
- Flexible diffusion modeling of long videos. arXiv preprint arXiv:2205.11495 (2022).
- Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626 (2022).
- Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems 30 (2017).
- Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303 (2022).
- Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems 33 (2020), 6840–6851.
- Cascaded Diffusion Models for High Fidelity Image Generation. J. Mach. Learn. Res. 23, 47 (2022), 1–33.
- Video diffusion models. arXiv preprint arXiv:2204.03458 (2022).
- Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Computation 9, 8 (11 1997), 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735 arXiv:https://direct.mit.edu/neco/article-pdf/9/8/1735/813796/neco.1997.9.8.1735.pdf
- Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE transactions on pattern analysis and machine intelligence 36, 7 (2013), 1325–1339.
- Disentangled representation learning for 3d face shape. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11957–11966.
- Deep Video Portraits. ACM Transactions on Graphics (TOG) 37, 4 (2018), 163.
- FLAME: free-form language-based motion synthesis & editing. In Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence (AAAI’23/IAAI’23/EAAI’23). AAAI Press, Article 927, 9 pages. https://doi.org/10.1609/aaai.v37i7.25996
- Semi-supervised learning with deep generative models. In Advances in neural information processing systems. 3581–3589.
- Diederik P. Kingma and Max Welling. 2014. Auto-Encoding Variational Bayes. In International Conference on Learning Representations. http://arxiv.org/abs/1312.6114
- Diffwave: A versatile diffusion model for audio synthesis. arXiv preprint arXiv:2009.09761 (2020).
- Danceformer: Music conditioned 3d dance generation with parametric motion transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 1272–1279.
- Srdiff: Single image super-resolution with diffusion probabilistic models. Neurocomputing 479 (2022), 47–59.
- Ai choreographer: Music conditioned 3d dance generation with aist++. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 13401–13412.
- Learning a model of facial shape and expression from 4D scans. ACM Trans. Graph. 36, 6 (2017), 194–1.
- Diffusion-LM Improves Controllable Text Generation. arXiv preprint arXiv:2205.14217 (2022).
- SMPL: A skinned multi-person linear model. ACM transactions on graphics (TOG) 34, 6 (2015), 1–16.
- RePaint: Inpainting using Denoising Diffusion Probabilistic Models. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 11451–11461.
- Shitong Luo and Wei Hu. 2021. Diffusion probabilistic models for 3d point cloud generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2837–2845.
- A conditional point diffusion-refinement paradigm for 3d point cloud completion. arXiv preprint arXiv:2112.03530 (2021).
- AMASS: Archive of motion capture as surface shapes. In Proceedings of the IEEE/CVF international conference on computer vision. 5442–5451.
- Mehdi Mirza and Simon Osindero. 2014. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784 (2014).
- Plug & play generative networks: Conditional iterative generation of images in latent space. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4467–4477.
- Alexander Quinn Nichol and Prafulla Dhariwal. 2021. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning. PMLR, 8162–8171.
- Jun-Yong Noh and Douglas Fidaleo. 2000. Animated Deformations with Radial Basis Functions. In In ACM Virtual Reality and Software Technology (VRST. 166–174.
- Dynamic facial expression generation on hilbert hypersphere with conditional wasserstein generative adversarial nets. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020).
- Sparse to Dense Dynamic 3D Facial Expression Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 20385–20394.
- Diffusevae: Efficient, controllable and high-fidelity generation from low-dimensional latents. arXiv preprint arXiv:2201.00308 (2022).
- Action-conditioned 3d human motion synthesis with transformer vae. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10985–10995.
- Resynthesizing facial animation through 3D model-based tracking. In Proceedings of the Seventh IEEE International Conference on Computer Vision, Vol. 1. 143–150 vol.1. https://doi.org/10.1109/ICCV.1999.791210
- The KIT motion-language dataset. Big data 4, 4 (2016), 236–252.
- Learning to Generate Customized Dynamic 3D Facial Expressions. In Computer Vision – ECCV 2020: 16th European Conference (Glasgow, United Kingdom). Springer-Verlag, 278–294. https://doi.org/10.1007/978-3-030-58526-6_17
- Diffusion autoencoders: Toward a meaningful and decodable representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10619–10629.
- Ganimation: Anatomically-aware facial animation from a single image. In Proc. European conference on computer vision. 818–833.
- BABEL: Bodies, action and behavior with english labels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 722–731.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning. PMLR, 8748–8763.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 (2022).
- Generating 3D faces using convolutional mesh autoencoders. In Proceedings of the European conference on computer vision (ECCV). 704–720.
- Danilo Rezende and Shakir Mohamed. 2015. Variational inference with normalizing flows. In International conference on machine learning. PMLR, 1530–1538.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10684–10695.
- U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention. Springer, 234–241.
- Palette: Image-to-image diffusion models. In ACM SIGGRAPH 2022 Conference Proceedings. 1–10.
- Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. arXiv preprint arXiv:2205.11487 (2022).
- Image super-resolution via iterative refinement. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022).
- Hyewon Seo and Guoliang Luo. 2021. Generating 3D Facial Expressions with Recurrent Neural Networks. In Intelligent Scene Modeling and Human-Computer Interaction. Springer International Publishing, 181–196. https://doi.org/10.1007/978-3-030-71002-6_11
- Interpreting the Latent Space of GANs for Semantic Face Editing. In CVPR.
- Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning. PMLR, 2256–2265.
- Learning structured output representation using deep conditional generative models. Advances in neural information processing systems 28 (2015), 3483–3491.
- Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020).
- Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456 (2020).
- MotionCLIP: Exposing Human Motion Generation to CLIP Space. arXiv preprint arXiv:2203.08063 (2022).
- Human Motion Diffusion Model. In The Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=SJ1kSyO2jwu
- 3D face reconstruction from a single image assisted by 2D face images in the wild. IEEE Transactions on Multimedia 23 (2020), 1160–1172.
- MoCoGAN: Decomposing Motion and Content for Video Generation. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1526–1535. https://doi.org/10.1109/CVPR.2018.00165
- Attention is all you need. In Advances in neural information processing systems. 5998–6008.
- Face Transfer with Multilinear Models. ACM Transactions on Graphics 24 (07 2006). https://doi.org/10.1145/1185657.1185864
- EDICT: Exact Diffusion Inversion via Coupled Transformations. arXiv preprint arXiv:2211.12446 (2022).
- Every smile is unique: Landmark-guided diverse smile generation. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. 7083–7092.
- G3AN: Disentangling Appearance and Motion for Video Generation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition.
- Imaginator: Conditional spatio-temporal gan for video generation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 1160–1169.
- Diffusion-gan: Training gans with diffusion. arXiv preprint arXiv:2206.02262 (2022).
- Joint Deep Learning of Facial Expression Synthesis and Recognition. IEEE Transactions on Multimedia 22 (2020), 2792–2807. https://api.semanticscholar.org/CorpusID:211044056
- Diffusion probabilistic modeling for video generation. arXiv preprint arXiv:2203.09481 (2022).
- Facial Expression Retargeting From Human to Avatar Made Easy. IEEE Transactions on Visualization and Computer Graphics 28 (2020), 1274–1287. https://api.semanticscholar.org/CorpusID:221103938
- MotionDiffuse: Text-Driven Human Motion Generation With Diffusion Model. IEEE Transactions on Pattern Analysis and Machine Intelligence 01 (jan 5555), 1–15. https://doi.org/10.1109/TPAMI.2024.3355414
- A high-resolution spontaneous 3d dynamic facial expression database. In IEEE workshops on automatic face and gesture recognition. 1–6.
- 3d shape generation and completion through point-voxel diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5826–5835.