3D Facial Expressions through Analysis-by-Neural-Synthesis (2404.04104v1)
Abstract: While existing methods for 3D face reconstruction from in-the-wild images excel at recovering the overall face shape, they commonly miss subtle, extreme, asymmetric, or rarely observed expressions. We improve upon these methods with SMIRK (Spatial Modeling for Image-based Reconstruction of Kinesics), which faithfully reconstructs expressive 3D faces from images. We identify two key limitations in existing methods: shortcomings in their self-supervised training formulation, and a lack of expression diversity in the training images. For training, most methods employ differentiable rendering to compare a predicted face mesh with the input image, along with a plethora of additional loss functions. This differentiable rendering loss not only has to provide supervision to optimize for 3D face geometry, camera, albedo, and lighting, which is an ill-posed optimization problem, but the domain gap between rendering and input image further hinders the learning process. Instead, SMIRK replaces the differentiable rendering with a neural rendering module that, given the rendered predicted mesh geometry, and sparsely sampled pixels of the input image, generates a face image. As the neural rendering gets color information from sampled image pixels, supervising with neural rendering-based reconstruction loss can focus solely on the geometry. Further, it enables us to generate images of the input identity with varying expressions while training. These are then utilized as input to the reconstruction model and used as supervision with ground truth geometry. This effectively augments the training data and enhances the generalization for diverse expressions. Our qualitative, quantitative and particularly our perceptual evaluations demonstrate that SMIRK achieves the new state-of-the art performance on accurate expression reconstruction. Project webpage: https://georgeretsi.github.io/smirk/.
- Lrs3-ted: a large-scale dataset for visual speech recognition. arXiv preprint arXiv:1809.00496, 2018.
- Towards a Perceptual Model for Estimating the Quality of Visual Speech, 2022. arXiv:2203.10117 [cs, eess].
- Inverse rendering of faces with a 3d morphable model. IEEE transactions on pattern analysis and machine intelligence, 35(5):1080–1093, 2012.
- Densereg: Fully convolutional dense shape regression in-the-wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6799–6808, 2017.
- FFHQ-UV: Normalized facial uv-texture dataset for 3d face reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 362–371, 2023.
- Fitting a 3D morphable model to edges: A comparison between hard and soft correspondences. In Asian Conference on Computer Vision Workshops, pages 377–391, 2017.
- A morphable model for the synthesis of 3D faces. In Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH), 1999.
- Face identification across different poses and illuminations with a 3D morphable model. In International Conference on Automatic Face & Gesture Recognition (FG), pages 202–207, 2002.
- Instant multi-view head capture through learnable registration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 768–779, 2023.
- 3d face morphable models” in-the-wild”. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 48–57, 2017.
- 3D reconstruction of “in-the-wild” faces in images and videos. IEEE transactions on pattern analysis and machine intelligence, 40(11):2638–2652, 2018.
- Review of statistical shape spaces for 3D data with comparative analysis for human faces. Computer Vision and Image Understanding (CVIU), 128:1–17, 2014.
- How far are we from solving the 2d & 3d face alignment problem?(and a dataset of 230,000 3d facial landmarks). In Proceedings of the IEEE International Conference on Computer Vision, pages 1021–1030, 2017.
- Facewarehouse: A 3d facial expression database for visual computing. IEEE Transactions on Visualization and Computer Graphics, 20(3):413–425, 2013.
- Displaced dynamic expression regression for real-time facial tracking and animation. Transactions on Graphics (TOG), 33(4):1–10, 2014.
- Vggface2: A dataset for recognising faces across pose and age. In 2018 13th IEEE international conference on automatic face & gesture recognition (FG 2018), pages 67–74. IEEE, 2018.
- ExpNet: Landmark-free, deep, 3D facial expressions. In International Conference on Automatic Face & Gesture Recognition (FG), pages 122–129, 2018.
- Sider: Single-image neural optimization for facial geometric detail recovery. In 2021 International Conference on 3D Vision (3DV), pages 815–824. IEEE, 2021.
- EMOCA: Emotion driven monocular face capture and animation. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 20311–20322, 2022.
- Retinaface: Single-shot multi-level face localisation in the wild. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5203–5212, 2020.
- Accurate 3D face reconstruction with weakly-supervised learning: From single image to image set. In Conference on Computer Vision and Pattern Recognition Workshops (CVPR-W), pages 285–295, 2019.
- Diffusionrig: Learning personalized priors for facial appearance editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12736–12746, 2023.
- End-to-end 3D face reconstruction with deep neural networks. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 5908–5917, 2017.
- Head2head++: Deep facial attributes re-targeting. IEEE Transactions on Biometrics, Behavior, and Identity Science, 3(1):31–43, 2021.
- Headgan: One-shot neural head synthesis and editing. In Proceedings of the IEEE/CVF International conference on Computer Vision, pages 14398–14407, 2021.
- Bernhard Egger. Semantic Morphable Models. PhD thesis, University of Basel, 2018.
- 3D morphable face models—past, present, and future. Transactions on Graphics (TOG), 39(5), 2020.
- Joint 3d face reconstruction and dense alignment with position map regression network. In European Conference on Computer Vision (ECCV), 2018.
- Learning an animatable detailed 3D face model from in-the-wild images. Transactions on Graphics, (Proc. SIGGRAPH), 40(4):1–13, 2021.
- SPECTRE: Visual speech-informed perceptual 3D facial expression reconstruction from videos. In Conference on Computer Vision and Pattern Recognition Workshops (CVPR-W), pages 5745–5755, 2023.
- Reconstruction of personalized 3d face rigs from monocular video. ACM Transactions on Graphics (TOG), 35(3):1–15, 2016a.
- Corrective 3d reconstruction of lips from monocular video. ACM Trans. Graph., 35(6):219–1, 2016b.
- GANFIT: Generative adversarial network fitting for high fidelity 3D face reconstruction. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 1155–1164, 2019.
- Unsupervised training for 3D morphable model regression. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 8377–8386, 2018.
- Morphable face models - an open framework. In International Conference on Automatic Face & Gesture Recognition (FG), pages 75–82, 2018.
- GIF: Generative interpretable faces. In 2020 International Conference on 3D Vision (3DV), pages 868–878. IEEE, 2020.
- Towards fast, accurate and stable 3d dense face alignment. In Proceedings of the European Conference on Computer Vision (ECCV), 2020.
- Speech4mesh: Speech-assisted monocular 3d facial reconstruction for speech-driven 3d facial animation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14192–14202, 2023.
- Autolink: Self-supervised learning of human skeletons and object outlines by linking keypoints. Advances in Neural Information Processing Systems, 35:36123–36141, 2022.
- Searching for mobilenetv3. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1314–1324, 2019.
- Image-to-image translation with conditional adversarial networks. CoRR, abs/1611.07004, 2016.
- Large pose 3D face reconstruction from a single image via direct volumetric CNN regression. In International Conference on Computer Vision (ICCV), pages 1031–1039, 2017.
- Perceptual losses for real-time style transfer and super-resolution. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, pages 694–711. Springer, 2016.
- Learning free-form deformation for 3D face reconstruction from in-the-wild images. In International Conference on Systems, Man, and Cybernetics (SMC), pages 2737–2742, 2021.
- A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019.
- Deep video portraits. ACM Transactions on Graphics (TOG), 37(4):163, 2018a.
- InverseFaceNet: deep monocular inverse face rendering. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 4625–4634, 2018b.
- Tatsuro Koizumi and William A. P. Smith. ”look ma, no landmarks!” - unsupervised, model-based dense face alignment. In European Conference on Computer Vision (ECCV), pages 690–706, 2020.
- AvatarMe: Realistically renderable 3d facial reconstruction” in-the-wild”. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 760–769, 2020.
- Fitme: Deep photorealistic 3d morphable model avatars. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8629–8640, 2023.
- Uncertainty-aware mesh decoder for high fidelity 3d face reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6100–6109, 2020.
- To fit or not to fit: Model-based face reconstruction and occlusion segmentation from weak supervision. CoRR, abs/2106.09614, 2021a.
- Realtime facial animation with on-the-fly correctives. Transactions on Graphics (TOG), 32(4):42–1, 2013.
- Learning a model of facial shape and expression from 4D scans. ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), 36(6):194:1–194:17, 2017.
- TF-FLAME. https://github.com/TimoBolkart/TF_FLAME, 2021b.
- Towards high-fidelity 3d face reconstruction from in-the-wild images using graph convolutional networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5891–5900, 2020.
- Dense face alignment. In International Conference on Computer Vision Workshops (ICCV-W), pages 1619–1628, 2017.
- Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), 2015.
- Normalized avatar synthesis using stylegan and perceptual refinement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11662–11672, 2021.
- Learning complete 3d morphable face models from images and videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3361–3371, 2021.
- Affectnet: A database for facial expression, valence, and arousal computing in the wild. IEEE Transactions on Affective Computing, 10(1):18–31, 2017.
- The uncanny valley [from the field]. IEEE Robotics & automation magazine, 19(2):98–100, 2012.
- A perceptual shape loss for monocular 3D face reconstruction. Computer Graphics Forum (Proc. Pacific Graphics), 2023.
- Neural emotion director: Speech-preserving semantic control of facial expressions in” in-the-wild” videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18781–18790, 2022.
- DeepSDF: Learning continuous signed distance functions for shape representation. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 165–174, 2019.
- A 3d face model for pose and illumination invariant face recognition. In 2009 sixth IEEE international conference on advanced video and signal based surveillance, pages 296–301. Ieee, 2009.
- Towards a complete 3D morphable model of the human head. Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 43(11):4142–4160, 2021.
- Meshtalk: 3d face animation from speech using cross-modality disentanglement. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1173–1182, 2021.
- 3D face reconstruction by learning from synthetic data. In International Conference on 3D Vision (3DV), pages 460–469, 2016.
- Estimating 3D shape and texture using pixel intensity, edges, specular highlights, texture constraints and aprior. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 986–993, 2005.
- U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention - MICCAI 2015 - 18th International Conference Munich, Germany, October 5 - 9, 2015, Proceedings, Part III, pages 234–241. Springer, 2015.
- SADRNet: Self-aligned dual face regression networks for robust 3d dense face alignment and reconstruction. IEEE Transactions on Image Processing, 30:5793–5806, 2021.
- Learning to regress 3D face shape and expression from an image without 3d supervision. In Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
- Unrestricted facial geometry reconstruction using image-to-image translation. In International Conference on Computer Vision (ICCV), pages 1576–1585, 2017.
- Self-supervised monocular 3D face reconstruction by occlusion-aware multi-view geometry consistency. In European Conference on Computer Vision (ECCV), pages 53–70. Springer, 2020.
- William AP Smith. The perspective face shape ambiguity. In Perspectives in Shape Analysis, pages 299–319. Springer, 2016.
- Unsupervised generative 3D shape learning from natural images. CoRR, abs/1910.00287, 2019.
- MoFA: Model-based deep convolutional face autoencoder for unsupervised monocular reconstruction. In International Conference on Computer Vision (ICCV), pages 1274–1283, 2017.
- Self-supervised multi-level face model learning for monocular reconstruction at over 250 hz. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2549–2559, 2018.
- FML: face model learning from videos. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 10812–10822, 2019.
- Real-time expression transfer for facial reenactment. ACM Trans. Graph., 34(6), 2015.
- Facevr: Real-time facial reenactment and eye gaze control in virtual reality. arXiv preprint arXiv:1610.03151, 2016a.
- Face2Face: Real-time face capture and reenactment of RGB videos. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 2387–2395, 2016b.
- Estimation of continuous valence and arousal levels from faces in naturalistic conditions. Nature Machine Intelligence, 3(1):42–50, 2021.
- Regressing robust and discriminative 3D morphable models with a very deep neural network. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 1599–1608, 2017.
- Extreme 3d face reconstruction: Seeing through occlusions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3935–3944, 2018.
- Nonlinear 3d face morphable model. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7346–7355, 2018.
- Towards high-fidelity nonlinear 3d face morphable model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1126–1135, 2019.
- Mead: A large-scale audio-visual dataset for emotional talking-face generation. In ECCV, 2020.
- 3D dense face alignment via graph convolution networks. arXiv preprint arXiv:1904.05562, 2019.
- 3D face reconstruction with dense landmarks. In European Conference on Computer Vision (ECCV), pages 160–177. Springer, 2022.
- Unsupervised learning of probably symmetric deformable 3D objects from images in the wild. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–10, 2020.
- Multiface: A dataset for neural face rendering. in arxiv, 2022.
- Segformer: Simple and efficient design for semantic segmentation with transformers. Advances in Neural Information Processing Systems, 34:12077–12090, 2021.
- Codetalker: Speech-driven 3d facial animation with discrete motion prior. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12780–12790, 2023.
- Facescape: a large-scale high quality 3d face dataset and detailed riggable 3d face prediction. In Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
- i3dmm: Deep implicit 3d morphable model of human heads. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12803–12813, 2021.
- DF2Net: A dense-fine-finer network for detailed 3D face reconstruction. In International Conference on Computer Vision (ICCV), 2019.
- The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018.
- Accurate 3d face reconstruction with facial component tokens. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9033–9042, 2023.
- Unpaired image-to-image translation using cycle-consistent adversarial networks. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pages 2242–2251. IEEE Computer Society, 2017.
- Face alignment across large poses: A 3D solution. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 146–155, 2016.
- Towards metrical reconstruction of human faces. In European Conference on Computer Vision, pages 250–269, 2022.
- State of the art on monocular 3D face reconstruction, tracking, and applications. Computer Graphics Forum, 2018.