A Generative Framework for Self-Supervised Facial Representation Learning (2309.08273v4)
Abstract: Self-supervised representation learning has gained increasing attention for strong generalization ability without relying on paired datasets. However, it has not been explored sufficiently for facial representation. Self-supervised facial representation learning remains unsolved due to the coupling of facial identities, expressions, and external factors like pose and light. Prior methods primarily focus on contrastive learning and pixel-level consistency, leading to limited interpretability and suboptimal performance. In this paper, we propose LatentFace, a novel generative framework for self-supervised facial representations. We suggest that the disentangling problem can be also formulated as generative objectives in space and time, and propose the solution using a 3D-aware latent diffusion model. First, we introduce a 3D-aware autoencoder to encode face images into 3D latent embeddings. Second, we propose a novel representation diffusion model to disentangle 3D latent into facial identity and expression. Consequently, our method achieves state-of-the-art performance in facial expression recognition (FER) and face verification among self-supervised facial representation learning models. Our model achieves a 3.75\% advantage in FER accuracy on RAF-DB and 3.35\% on AffectNet compared to SOTA methods.
- “The role of expression and identity in the face-selective responses of neurons in the temporal visual cortex of the monkey,” Behavioural brain research, vol. 32, no. 3, pp. 203–218, 1989.
- “Vggface2: A dataset for recognising faces across pose and age,” in FG, 2018, pp. 67–74.
- “Region attention networks for pose and occlusion robust facial expression recognition,” TIP, vol. 29, pp. 4057–4069, 2020.
- “Facial expression recognition with inconsistently annotated datasets,” in ECCV, 2018, pp. 222–237.
- “Tackling long-tailed category distribution under domain shifts,” in ECCV. Springer, 2022, pp. 727–743.
- Dimitrios Kollias, “Abaw: Valence-arousal estimation, expression recognition, action unit detection & multi-task learning challenges,” in CVPR, 2022, pp. 2328–2336.
- “A simple framework for contrastive learning of visual representations,” in International conference on machine learning. PMLR, 2020, pp. 1597–1607.
- “Masked autoencoders are scalable vision learners,” in CVPR, 2022, pp. 16000–16009.
- “Self-supervised representation learning from videos for facial action unit detection,” in CVPR, 2019, pp. 10916–10925.
- “Self-supervised learning for facial action unit recognition through temporal consistency,” in BMVC, 2020.
- “Learning facial representations from the cycle-consistency of face,” in ICCV, 2021, pp. 9660–9669.
- “Revisiting self-supervised contrastive learning for facial expression recognition,” in BMVC, 2022.
- “Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild,” in CVPR, 2017, pp. 2584–2593.
- “Affectnet: A database for facial expression, valence, and arousal computing in the wild,” IEEE Transactions on Affective Computing, vol. 10, no. 1, pp. 18–31, 2019.
- “Deep learning face attributes in the wild,” in ICCV, 2015, pp. 3730–3738.
- “Unsupervised learning of probably symmetric deformable 3d objects from images in the wild,” TPAMI, pp. 1–1, 2021.
- “A morphable model for the synthesis of 3d faces,” in SIGGRAPH, 1999, p. 187–194.
- “VoxCeleb: A Large-Scale Speaker Identification Dataset,” in Proc. Interspeech 2017, 2017, pp. 2616–2620.
- “High-resolution image synthesis with latent diffusion models,” CVPR, pp. 10674–10685, 2021.
- “Denoising diffusion probabilistic models,” in NeurIPS, 2020, vol. 33, pp. 6840–6851.
- “Facenet: A unified embedding for face recognition and clustering,” pp. 815–823, 2015.
- “Analysing affective behavior in the second abaw2 competition,” 2021.
- “Towards semi-supervised deep facial expression recognition with an adaptive confidence margin,” in CVPR, 2022, pp. 4166–4175.
- Gary B. Huang Erik Learned-Miller, “Labeled faces in the wild: Updates and new reporting procedures,” Tech. Rep. UM-CS-2014-003, University of Massachusetts, Amherst, May 2014.
- “Fine-grained face verification: Fglfw database, baselines, and human-dcmn partnership,” Pattern Recognition, vol. 66, pp. 63–73, 2017.
- “Arcface: Additive angular margin loss for deep face recognition. in 2019 ieee,” in CVPR, 2018, pp. 4685–4694.
- “Denoising diffusion probabilistic models,” arXiv, vol. abs/2006.11239, 2020.
- “Deep residual learning for image recognition,” pp. 770–778, 2016.
- “Very deep convolutional networks for large-scale image recognition,” 2015.
- “U-net: Convolutional networks for biomedical image segmentation,” arXiv, vol. abs/1505.04597, 2015.
- “Accurate 3d face reconstruction with weakly-supervised learning: From single image to image set,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2019, pp. 285–295.
- “Learning an animatable detailed 3d face model from in-the-wild images,” ACM Trans. Graph., vol. 40, no. 4, jul 2021.
- “What uncertainties do we need in bayesian deep learning for computer vision?,” in NeurIPS, Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, Eds., 2017, pp. 5574–5584.
- “Video diffusion models,” arXiv, vol. abs/2204.03458, 2022.
- Bui Tuong Phong, “Illumination for computer generated pictures,” Commun. ACM, vol. 18, no. 6, pp. 311–317, jun 1975.