Self-Supervised Facial Representation Learning with Facial Region Awareness (2403.02138v1)
Abstract: Self-supervised pre-training has been proved to be effective in learning transferable representations that benefit various visual tasks. This paper asks this question: can self-supervised pre-training learn general facial representations for various facial analysis tasks? Recent efforts toward this goal are limited to treating each face image as a whole, i.e., learning consistent facial representations at the image-level, which overlooks the consistency of local facial representations (i.e., facial regions like eyes, nose, etc). In this work, we make a first attempt to propose a novel self-supervised facial representation learning framework to learn consistent global and local facial representations, Facial Region Awareness (FRA). Specifically, we explicitly enforce the consistency of facial regions by matching the local facial representations across views, which are extracted with learned heatmaps highlighting the facial regions. Inspired by the mask prediction in supervised semantic segmentation, we obtain the heatmaps via cosine similarity between the per-pixel projection of feature maps and facial mask embeddings computed from learnable positional embeddings, which leverage the attention mechanism to globally look up the facial image for facial regions. To learn such heatmaps, we formulate the learning of facial mask embeddings as a deep clustering problem by assigning the pixel features from the feature maps to them. The transfer learning results on facial classification and regression tasks show that our FRA outperforms previous pre-trained models and more importantly, using ResNet as the unified backbone for various tasks, our FRA achieves comparable or even better performance compared with SOTA methods in facial analysis tasks.
- Masked siamese networks for label-efficient learning. In European Conference on Computer Vision, pages 456–473. Springer, 2022.
- Training deep networks for facial expression recognition with crowd-sourced label distribution. In Proceedings of the 18th ACM International Conference on Multimodal Interaction, page 279–283, New York, NY, USA, 2016. Association for Computing Machinery.
- Pre-training strategies and datasets for facial representation learning. In Computer Vision – ECCV 2022, pages 107–125, Cham, 2022. Springer Nature Switzerland.
- Marlin: Masked autoencoder for facial video representation learning. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1493–1504, 2023.
- Partially shared multi-task convolutional neural network with local constraint for face attribute learning. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4290–4299, 2018a.
- Vggface2: A dataset for recognising faces across pose and age. In 2018 13th IEEE international conference on automatic face & gesture recognition (FG 2018), pages 67–74. IEEE, 2018b.
- End-to-end object detection with transformers. In Computer Vision – ECCV 2020, pages 213–229, Cham, 2020. Springer International Publishing.
- Unsupervised learning of visual features by contrasting cluster assignments. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2020.
- Learning facial representations from the cycle-consistency of face. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9660–9669, 2021.
- A simple framework for contrastive learning of visual representations. In Proceedings of the 37th International Conference on Machine Learning, pages 1597–1607. PMLR, 2020a.
- Exploring simple siamese representation learning. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15745–15753, 2021.
- Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297, 2020b.
- Per-pixel classification is not all you need for semantic segmentation. In Advances in Neural Information Processing Systems, pages 17864–17875. Curran Associates, Inc., 2021a.
- On equivariant and invariant learning of object landmark representations. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9877–9886, 2021b.
- BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 4171–4186. Association for Computational Linguistics, 2019.
- Bootstrapped masked autoencoders for vision bert pretraining. In Computer Vision – ECCV 2022, pages 247–264, Cham, 2022. Springer Nature Switzerland.
- An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.
- With a little help from my friends: Nearest-neighbor contrastive learning of visual representations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9588–9597, 2021.
- Challenges in representation learning: A report on three machine learning contests. In Neural Information Processing: 20th International Conference, ICONIP 2013, Daegu, Korea, November 3-7, 2013. Proceedings, Part III 20, pages 117–124. Springer, 2013.
- Bootstrap your own latent - a new approach to self-supervised learning. In Advances in Neural Information Processing Systems, pages 21271–21284. Curran Associates, Inc., 2020.
- Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
- Momentum contrast for unsupervised visual representation learning. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9726–9735, 2020.
- Masked autoencoders are scalable vision learners. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15979–15988, 2022.
- Learning deep representations by mutual information estimation and maximization. In ICLR 2019. ICLR, 2019.
- Learning where to learn in cross-view self-supervised learning. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14431–14440, 2022.
- Adnet: Leveraging error-bias towards normal direction in face alignment. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 3060–3070, 2021.
- Unsupervised learning of object landmarks through conditional image generation. In Advances in Neural Information Processing Systems, pages 4016–4027. Curran Associates, Inc., 2018.
- Invariant information clustering for unsupervised image classification and segmentation. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9864–9873, 2019.
- Scaling up visual and vision-language representation learning with noisy text supervision. In Proceedings of the 38th International Conference on Machine Learning, pages 4904–4916. PMLR, 2021.
- Supervised contrastive learning. In Advances in Neural Information Processing Systems, pages 18661–18673. Curran Associates, Inc., 2020.
- Luvli face alignment: Estimating landmarks’ location, uncertainty, and visibility likelihood. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8233–8243, 2020.
- Adaptively learning facial expression representation via c-f labels and distillation. IEEE Transactions on Image Processing, 30:2016–2028, 2021a.
- Towards accurate facial landmark detection via cascaded transformers. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4166–4175, 2022a.
- Landmark free face attribute prediction. IEEE Transactions on Image Processing, 27(9):4651–4662, 2018.
- Align before fuse: Vision and language representation learning with momentum distillation. In Advances in Neural Information Processing Systems, pages 9694–9705. Curran Associates, Inc., 2021b.
- Prototypical contrastive learning of unsupervised representations. In International Conference on Learning Representations, 2021c.
- Repformer: Refinement pyramid transformer for robust facial landmark detection. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, pages 1088–1094. International Joint Conferences on Artificial Intelligence Organization, 2022b. Main Track.
- Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2584–2593, 2017.
- Self-supervised representation learning from videos for facial action unit detection. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10916–10925, 2019.
- Learning representations for facial actions from unlabeled videos. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(1):302–317, 2022c.
- Pose-disentangled contrastive learning for self-supervised facial representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9717–9728, 2023.
- Deep learning face attributes in the wild. In 2015 IEEE International Conference on Computer Vision (ICCV), pages 3730–3738, 2015.
- Learning spatial-temporal implicit neural representations for event-guided video super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1557–1567, 2023.
- Deep multi-task multi-label cnn for effective facial attribute classification. IEEE Transactions on Affective Computing, 13(2):818–828, 2022.
- Affectnet: A database for facial expression, valence, and arousal computing in the wild. IEEE Transactions on Affective Computing, 10(1):18–31, 2019.
- Micron-bert: Bert-based facial micro-expression recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1482–1492, 2023.
- Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.
- 300 faces in-the-wild challenge: The first facial landmark localization challenge. In 2013 IEEE International Conference on Computer Vision Workshops, pages 397–403, 2013a.
- A semi-automatic methodology for facial landmark annotation. In 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 896–903, 2013b.
- 300 faces in-the-wild challenge: Database and results. Image and vision computing, 47:3–18, 2016.
- Slim-cnn: A light-weight cnn for face attribute prediction. In 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), pages 329–335, 2020.
- Learning spatial-semantic relationship for facial attribute recognition with limited labeled data. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11911–11920, 2021.
- Revisiting self-supervised contrastive learning for facial expression recognition. In 33rd British Machine Vision Conference 2022, BMVC 2022, London, UK, November 21-24, 2022. BMVA Press, 2022.
- Siamese image modeling for self-supervised vision representation learning. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2132–2141, 2023.
- Agrnet: Adaptive graph representation learning and reasoning for face parsing. IEEE Transactions on Image Processing, 30:8236–8250, 2021.
- Unsupervised learning of landmarks by descriptor vector exchange. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 6360–6370, 2019.
- Ucol: Unsupervised learning of discriminative facial representations via uncertainty-aware contrast. Proceedings of the AAAI Conference on Artificial Intelligence, 37(2):2510–2518, 2023a.
- Dense contrastive learning for self-supervised visual pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3024–3033, 2021.
- Toward high quality facial representation learning. In Proceedings of the 31st ACM International Conference on Multimedia, page 5048–5058, New York, NY, USA, 2023b. Association for Computing Machinery.
- Look at boundary: A boundary-aware face alignment algorithm. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2129–2138, 2018.
- Sparse local patch transformer for robust face alignment and landmarks inherent relation learning. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4042–4051, 2022.
- Propagate yourself: Exploring pixel-level consistency for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16684–16693, 2021.
- Stacked hourglass network for robust facial landmark localisation. In 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 2025–2033, 2017.
- Dense interspecies face embedding. In Advances in Neural Information Processing Systems, pages 33275–33288. Curran Associates, Inc., 2022.
- Unsupervised embedding learning via invariant and spreading instance feature. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6203–6212, 2019.
- A 3d facial expression database for facial behavior research. In 7th international conference on automatic face and gesture recognition (FGR06), pages 211–216. IEEE, 2006.
- Weakly-supervised text-driven contrastive learning for facial behavior understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 20751–20762, 2023.
- Unsupervised discovery of object landmarks as structural representations. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2694–2703, 2018.
- Relative uncertainty learning for facial expression recognition. In Advances in Neural Information Processing Systems, pages 17616–17627. Curran Associates, Inc., 2021.
- Learn from all: Erasing attention consistency for noisy label facial expression recognition. In Computer Vision – ECCV 2022, pages 418–434, Cham, 2022. Springer Nature Switzerland.
- Deep region and multi-label learning for facial action unit detection. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3391–3399, 2016.
- General facial representation learning in a visual-linguistic manner. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18676–18688, 2022.
- Image BERT pre-training with online tokenizer. In International Conference on Learning Representations, 2022.
- Star loss: Reducing semantic ambiguity in facial landmark detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15475–15484, 2023.
- Face alignment across large poses: A 3d solution. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 146–155, 2016.