Contrastive Learning of Person-independent Representations for Facial Action Unit Detection (2403.03400v1)
Abstract: Facial action unit (AU) detection, aiming to classify AU present in the facial image, has long suffered from insufficient AU annotations. In this paper, we aim to mitigate this data scarcity issue by learning AU representations from a large number of unlabelled facial videos in a contrastive learning paradigm. We formulate the self-supervised AU representation learning signals in two-fold: (1) AU representation should be frame-wisely discriminative within a short video clip; (2) Facial frames sampled from different identities but show analogous facial AUs should have consistent AU representations. As to achieve these goals, we propose to contrastively learn the AU representation within a video clip and devise a cross-identity reconstruction mechanism to learn the person-independent representations. Specially, we adopt a margin-based temporal contrastive learning paradigm to perceive the temporal AU coherence and evolution characteristics within a clip that consists of consecutive input facial frames. Moreover, the cross-identity reconstruction mechanism facilitates pushing the faces from different identities but show analogous AUs close in the latent embedding space. Experimental results on three public AU datasets demonstrate that the learned AU representation is discriminative for AU detection. Our method outperforms other contrastive learning methods and significantly closes the performance gap between the self-supervised and supervised AU detection approaches.
- B. Martinez, M. F. Valstar, B. Jiang, and M. Pantic, “Automatic analysis of facial actions: A survey,” IEEE transactions on affective computing, vol. 10, no. 3, pp. 325–347, 2017.
- I. Kotsia, S. Zafeiriou, and I. Pitas, “Texture and shape information fusion for facial expression and facial action unit recognition,” Pattern Recognition, vol. 41, no. 3, pp. 833–851, 2008.
- Z. Zafar and N. A. Khan, “Pain intensity evaluation through facial action units,” in 2014 22nd International Conference on Pattern Recognition. IEEE, 2014, pp. 4696–4701.
- M. S. Bartlett, G. Littlewort, I. Fasel, and J. R. Movellan, “Real time face detection and facial expression recognition: development and applications to human computer interaction.” in 2003 Conference on computer vision and pattern recognition workshop, vol. 5. IEEE, 2003, pp. 53–53.
- K. Zhao, W.-S. Chu, and H. Zhang, “Deep region and multi-label learning for facial action unit detection,” in CVPR, 2016, pp. 3391–3399.
- W. Li, F. Abtahi, Z. Zhu, and L. Yin, “Eac-net: A region-based deep enhancing and cropping approach for facial action unit detection,” in 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017). IEEE, 2017, pp. 103–110.
- C. Corneanu, M. Madadi, and S. Escalera, “Deep structure inference network for facial action unit recognition,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 298–313.
- Z. Shao, Z. Liu, J. Cai, and L. Ma, “Deep adaptive attention for joint facial action unit detection and face alignment,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 705–720.
- G. Li, X. Zhu, Y. Zeng, Q. Wang, and L. Lin, “Semantic relationships guided representation learning for facial action unit recognition,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 01, 2019, pp. 8594–8601.
- H. Yang, L. Yin, Y. Zhou, and J. Gu, “Exploiting semantic embedding and visual feature for facial action unit detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 10 482–10 491.
- T. Song, Z. Cui, W. Zheng, and Q. Ji, “Hybrid message passing with performance-driven structures for facial action unit detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 6267–6276.
- X. Zhang, L. Yin, J. F. Cohn, S. Canavan, M. Reale, A. Horowitz, and P. Liu, “A high-resolution spontaneous 3d dynamic facial expression database,” in FG. IEEE, 2013.
- S. M. Mavadati, M. H. Mahoor, K. Bartlett, P. Trinh, and J. F. Cohn, “Disfa: A spontaneous facial action intensity database,” IEEE Transactions on Affective Computing, vol. 4, no. 2, pp. 151–160, 2013.
- J. M. Girard, J. F. Cohn, L. A. Jeni, S. Lucey, and F. De la Torre, “How much training data for facial action unit detection?” in 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), vol. 1. IEEE, 2015, pp. 1–8.
- K. Zhao, W.-S. Chu, and A. M. Martinez, “Learning facial action units from web images with scalable weakly supervised clustering,” in CVPR, 2018, pp. 2090–2099.
- X. Sun, J. Zeng, and S. Shan, “Emotion-aware contrastive learning for facial action unit detection,” in 2021 16th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2021). IEEE, 2021, pp. 01–08.
- Y. Zhang, W. Dong, B.-G. Hu, and Q. Ji, “Weakly-supervised deep convolutional neural network learning for facial action unit intensity estimation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 2314–2323.
- G. Peng and S. Wang, “Weakly supervised facial action unit recognition through adversarial training,” in CVPR, 2018, pp. 2188–2196.
- Q. Kong, W. Wei, Z. Deng, T. Yoshinaga, and T. Murakami, “Cycle-contrast for self-supervised video representation learning,” Advances in Neural Information Processing Systems, vol. 33, pp. 8089–8100, 2020.
- A. Miech, J.-B. Alayrac, L. Smaira, I. Laptev, J. Sivic, and A. Zisserman, “End-to-end learning of visual representations from uncurated instructional videos,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
- Y. Li and G. Zhao, “Intra-and inter-contrastive learning for micro-expression action unit detection,” in Proceedings of the 2021 International Conference on Multimodal Interaction, 2021, pp. 702–706.
- H. Wu and X. Wang, “Contrastive learning of image representations with cross-video cycle-consistency,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10 149–10 159.
- R. Qian, T. Meng, B. Gong, M.-H. Yang, H. Wang, S. Belongie, and Y. Cui, “Spatiotemporal contrastive video representation learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2021, pp. 6964–6974.
- X. Wang, A. Jabri, and A. A. Efros, “Learning correspondence from the cycle-consistency of time,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 2566–2576.
- L. Yang, I. O. Ertugrul, J. F. Cohn, Z. Hammal, D. Jiang, and H. Sahli, “Facs3d-net: 3d convolution based spatiotemporal representation for action unit detection,” in 2019 8th International Conference on Affective Computing and Intelligent Interaction (ACII). IEEE, 2019, pp. 538–544.
- M. K. Tellamekala and M. Valstar, “Temporally coherent visual representations for dimensional affect recognition,” in 2019 8th International Conference on Affective Computing and Intelligent Interaction (ACII). IEEE, 2019, pp. 1–7.
- T. Almaev, B. Martinez, and M. Valstar, “Learning to transfer: transferring latent task structures and its application to person-specific facial action unit detection,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 3774–3782.
- S. Song, S. Jaiswal, E. Sanchez, G. Tzimiropoulos, L. Shen, and M. Valstar, “Self-supervised learning of person-specific facial dynamics for automatic personality recognition,” IEEE Transactions on Affective Computing, 2021.
- K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 9729–9738.
- J. Yang, F. Zhang, B. Chen, and S. U. Khan, “Facial expression recognition based on facial action unit,” in 2019 Tenth International Green and Sustainable Computing Conference (IGSC). IEEE, 2019, pp. 1–6.
- G. Sandbach, S. Zafeiriou, and M. Pantic, “Local normal binary patterns for 3d facial action unit detection,” in 2012 19th IEEE International Conference on Image Processing. IEEE, 2012, pp. 1813–1816.
- C. Fabian Benitez-Quiroz, R. Srinivasan, and A. M. Martinez, “Emotionet: An accurate, real-time algorithm for the automatic annotation of a million facial expressions in the wild,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 5562–5570.
- W. Li, F. Abtahi, and Z. Zhu, “Action unit detection with region adaptation, multi-labeling learning and optimal temporal fusing,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1841–1850.
- C. Ma, L. Chen, and J. Yong, “Au r-cnn: Encoding expert prior knowledge into r-cnn for action unit detection,” neurocomputing, vol. 355, pp. 35–47, 2019.
- Z. Shao, Z. Liu, J. Cai, Y. Wu, and L. Ma, “Facial action unit detection using attention and relation learning,” IEEE transactions on affective computing, 2019. [Online]. Available: 10.1109/TAFFC.2019.2948635
- G. M. Jacob and B. Stenger, “Facial action unit detection with transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 7680–7689.
- Z. Zhang, S. Zhai, L. Yin et al., “Identity-based adversarial training of deep cnns for facial action unit recognition.” in BMVC. Newcastle, 2018, p. 226.
- C.-H. Tu, C.-Y. Yang, and J. Y.-j. Hsu, “Idennet: Identity-aware facial action unit detection,” in 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019). IEEE, 2019, pp. 1–8.
- H. Yang, T. Wang, and L. Yin, “Set operation aided network for action units detection,” in 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020). IEEE, 2020, pp. 229–235.
- O. Wiles, A. Koepke, and A. Zisserman, “Self-supervised learning of a facial attribute embedding from video,” BMVC, 2018.
- L. Lu, L. Tavabi, and M. Soleymani, “Self-supervised learning for facial action unit recognition through temporal consistency.” in BMVC, 2020.
- Y. Li, J. Zeng, S. Shan, and X. Chen, “Self-supervised representation learning from videos for facial action unit detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 10 924–10 933.
- Y. Li, J. Zeng, and S. Shan, “Learning representations for facial actions from unlabeled videos,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.
- Y. Li and S. Shan, “Meta auxiliary learning for facial action unit detection,” IEEE Transactions on Affective Computing, 2021. [Online]. Available: 10.1109/TAFFC.2021.3135516
- Z. Wu, Y. Xiong, S. X. Yu, and D. Lin, “Unsupervised feature learning via non-parametric instance discrimination,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 3733–3742.
- T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in International conference on machine learning. PMLR, 2020, pp. 1597–1607.
- X. Chen and K. He, “Exploring simple siamese representation learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 15 750–15 758.
- J.-B. Grill, F. Strub, F. Altché, C. Tallec, P. H. Richemond, E. Buchatskaya, C. Doersch, B. A. Pires, Z. D. Guo, M. G. Azar et al., “Bootstrap your own latent: A new approach to self-supervised learning,” arXiv preprint arXiv:2006.07733, 2020.
- L. Jing and Y. Tian, “Self-supervised visual feature learning with deep neural networks: A survey,” IEEE transactions on pattern analysis and machine intelligence, 2020.
- J. Zbontar, L. Jing, I. Misra, Y. LeCun, and S. Deny, “Barlow twins: Self-supervised learning via redundancy reduction,” arXiv preprint arXiv:2103.03230, 2021.
- T. Yao, Y. Zhang, Z. Qiu, Y. Pan, and T. Mei, “Seco: Exploring sequence supervision for unsupervised representation learning,” arXiv preprint arXiv:2008.00975, vol. 6, no. 7, 2020.
- Y. Sun, J. Zeng, S. Shan, and X. Chen, “Cross-encoder for unsupervised gaze representation learning,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 3702–3711.
- J. S. Chung, A. Nagrani, and A. Zisserman, “Voxceleb2: Deep speaker recognition,” in INTERSPEECH, 2018.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016.
- W.-S. Chu, F. De la Torre, and J. F. Cohn, “Learning spatial and temporal cues for multi-label facial action unit detection,” in FG. IEEE, 2017.
- T. Song, L. Chen, W. Zheng, and Q. Ji, “Uncertain graph neural networks for facial action unit detection,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 1, 2021.
- Y. Tang, W. Zeng, D. Zhao, and H. Zhang, “Piap-df: Pixel-interested and anti person-specific facial action unit detection net with discrete feedback learning,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 12 899–12 908.
- J. M. Girard, W.-S. Chu, L. A. Jeni, and J. F. Cohn, “Sayette group formation task (gft) spontaneous facial expression database,” in FG. IEEE, 2017, pp. 581–588.
- A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in pytorch,” in NIPS Workshop, 2017.
- S. Li, W. Deng, and J. Du, “Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2852–2861.
- A. Mollahosseini, B. Hasani, and M. H. Mahoor, “Affectnet: A database for facial expression, valence, and arousal computing in the wild,” IEEE Transactions on Affective Computing, vol. 10, no. 1, pp. 18–31, 2017.
- J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009, pp. 248–255.
- J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
- A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Communications of the ACM, vol. 60, no. 6, pp. 84–90, 2017.
- K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
- Y. Li, J. Zeng, S. Shan, and X. Chen, “Occlusion aware facial expression recognition using cnn with attention mechanism,” IEEE Transactions on Image Processing, vol. 28, no. 5, pp. 2439–2450, 2018.
- F. Ma, B. Sun, and S. Li, “Facial expression recognition with visual transformers and attentional selective fusion,” IEEE Transactions on Affective Computing, pp. 1–1, 2021.
- Z. Shu, M. Sahasrabudhe, R. A. Guler, D. Samaras, N. Paragios, and I. Kokkinos, “Deforming autoencoders: Unsupervised disentangling of shape and appearance,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 650–665.
- W.-J. Yan, S. Li, C. Que, J. Pei, and W. Deng, “Raf-au database: in-the-wild facial expressions with subjective emotion judgement and objective au annotations,” in Proceedings of the Asian Conference on Computer Vision, 2020.
- Yong Li (628 papers)
- Shiguang Shan (136 papers)