Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Self-Supervised Facial Representation Learning with Facial Region Awareness (2403.02138v1)

Published 4 Mar 2024 in cs.CV

Abstract: Self-supervised pre-training has been proved to be effective in learning transferable representations that benefit various visual tasks. This paper asks this question: can self-supervised pre-training learn general facial representations for various facial analysis tasks? Recent efforts toward this goal are limited to treating each face image as a whole, i.e., learning consistent facial representations at the image-level, which overlooks the consistency of local facial representations (i.e., facial regions like eyes, nose, etc). In this work, we make a first attempt to propose a novel self-supervised facial representation learning framework to learn consistent global and local facial representations, Facial Region Awareness (FRA). Specifically, we explicitly enforce the consistency of facial regions by matching the local facial representations across views, which are extracted with learned heatmaps highlighting the facial regions. Inspired by the mask prediction in supervised semantic segmentation, we obtain the heatmaps via cosine similarity between the per-pixel projection of feature maps and facial mask embeddings computed from learnable positional embeddings, which leverage the attention mechanism to globally look up the facial image for facial regions. To learn such heatmaps, we formulate the learning of facial mask embeddings as a deep clustering problem by assigning the pixel features from the feature maps to them. The transfer learning results on facial classification and regression tasks show that our FRA outperforms previous pre-trained models and more importantly, using ResNet as the unified backbone for various tasks, our FRA achieves comparable or even better performance compared with SOTA methods in facial analysis tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (75)
  1. Masked siamese networks for label-efficient learning. In European Conference on Computer Vision, pages 456–473. Springer, 2022.
  2. Training deep networks for facial expression recognition with crowd-sourced label distribution. In Proceedings of the 18th ACM International Conference on Multimodal Interaction, page 279–283, New York, NY, USA, 2016. Association for Computing Machinery.
  3. Pre-training strategies and datasets for facial representation learning. In Computer Vision – ECCV 2022, pages 107–125, Cham, 2022. Springer Nature Switzerland.
  4. Marlin: Masked autoencoder for facial video representation learning. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1493–1504, 2023.
  5. Partially shared multi-task convolutional neural network with local constraint for face attribute learning. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4290–4299, 2018a.
  6. Vggface2: A dataset for recognising faces across pose and age. In 2018 13th IEEE international conference on automatic face & gesture recognition (FG 2018), pages 67–74. IEEE, 2018b.
  7. End-to-end object detection with transformers. In Computer Vision – ECCV 2020, pages 213–229, Cham, 2020. Springer International Publishing.
  8. Unsupervised learning of visual features by contrasting cluster assignments. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2020.
  9. Learning facial representations from the cycle-consistency of face. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9660–9669, 2021.
  10. A simple framework for contrastive learning of visual representations. In Proceedings of the 37th International Conference on Machine Learning, pages 1597–1607. PMLR, 2020a.
  11. Exploring simple siamese representation learning. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15745–15753, 2021.
  12. Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297, 2020b.
  13. Per-pixel classification is not all you need for semantic segmentation. In Advances in Neural Information Processing Systems, pages 17864–17875. Curran Associates, Inc., 2021a.
  14. On equivariant and invariant learning of object landmark representations. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9877–9886, 2021b.
  15. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 4171–4186. Association for Computational Linguistics, 2019.
  16. Bootstrapped masked autoencoders for vision bert pretraining. In Computer Vision – ECCV 2022, pages 247–264, Cham, 2022. Springer Nature Switzerland.
  17. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.
  18. With a little help from my friends: Nearest-neighbor contrastive learning of visual representations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9588–9597, 2021.
  19. Challenges in representation learning: A report on three machine learning contests. In Neural Information Processing: 20th International Conference, ICONIP 2013, Daegu, Korea, November 3-7, 2013. Proceedings, Part III 20, pages 117–124. Springer, 2013.
  20. Bootstrap your own latent - a new approach to self-supervised learning. In Advances in Neural Information Processing Systems, pages 21271–21284. Curran Associates, Inc., 2020.
  21. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
  22. Momentum contrast for unsupervised visual representation learning. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9726–9735, 2020.
  23. Masked autoencoders are scalable vision learners. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15979–15988, 2022.
  24. Learning deep representations by mutual information estimation and maximization. In ICLR 2019. ICLR, 2019.
  25. Learning where to learn in cross-view self-supervised learning. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14431–14440, 2022.
  26. Adnet: Leveraging error-bias towards normal direction in face alignment. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 3060–3070, 2021.
  27. Unsupervised learning of object landmarks through conditional image generation. In Advances in Neural Information Processing Systems, pages 4016–4027. Curran Associates, Inc., 2018.
  28. Invariant information clustering for unsupervised image classification and segmentation. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9864–9873, 2019.
  29. Scaling up visual and vision-language representation learning with noisy text supervision. In Proceedings of the 38th International Conference on Machine Learning, pages 4904–4916. PMLR, 2021.
  30. Supervised contrastive learning. In Advances in Neural Information Processing Systems, pages 18661–18673. Curran Associates, Inc., 2020.
  31. Luvli face alignment: Estimating landmarks’ location, uncertainty, and visibility likelihood. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8233–8243, 2020.
  32. Adaptively learning facial expression representation via c-f labels and distillation. IEEE Transactions on Image Processing, 30:2016–2028, 2021a.
  33. Towards accurate facial landmark detection via cascaded transformers. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4166–4175, 2022a.
  34. Landmark free face attribute prediction. IEEE Transactions on Image Processing, 27(9):4651–4662, 2018.
  35. Align before fuse: Vision and language representation learning with momentum distillation. In Advances in Neural Information Processing Systems, pages 9694–9705. Curran Associates, Inc., 2021b.
  36. Prototypical contrastive learning of unsupervised representations. In International Conference on Learning Representations, 2021c.
  37. Repformer: Refinement pyramid transformer for robust facial landmark detection. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, pages 1088–1094. International Joint Conferences on Artificial Intelligence Organization, 2022b. Main Track.
  38. Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2584–2593, 2017.
  39. Self-supervised representation learning from videos for facial action unit detection. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10916–10925, 2019.
  40. Learning representations for facial actions from unlabeled videos. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(1):302–317, 2022c.
  41. Pose-disentangled contrastive learning for self-supervised facial representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9717–9728, 2023.
  42. Deep learning face attributes in the wild. In 2015 IEEE International Conference on Computer Vision (ICCV), pages 3730–3738, 2015.
  43. Learning spatial-temporal implicit neural representations for event-guided video super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1557–1567, 2023.
  44. Deep multi-task multi-label cnn for effective facial attribute classification. IEEE Transactions on Affective Computing, 13(2):818–828, 2022.
  45. Affectnet: A database for facial expression, valence, and arousal computing in the wild. IEEE Transactions on Affective Computing, 10(1):18–31, 2019.
  46. Micron-bert: Bert-based facial micro-expression recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1482–1492, 2023.
  47. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.
  48. 300 faces in-the-wild challenge: The first facial landmark localization challenge. In 2013 IEEE International Conference on Computer Vision Workshops, pages 397–403, 2013a.
  49. A semi-automatic methodology for facial landmark annotation. In 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 896–903, 2013b.
  50. 300 faces in-the-wild challenge: Database and results. Image and vision computing, 47:3–18, 2016.
  51. Slim-cnn: A light-weight cnn for face attribute prediction. In 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), pages 329–335, 2020.
  52. Learning spatial-semantic relationship for facial attribute recognition with limited labeled data. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11911–11920, 2021.
  53. Revisiting self-supervised contrastive learning for facial expression recognition. In 33rd British Machine Vision Conference 2022, BMVC 2022, London, UK, November 21-24, 2022. BMVA Press, 2022.
  54. Siamese image modeling for self-supervised vision representation learning. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2132–2141, 2023.
  55. Agrnet: Adaptive graph representation learning and reasoning for face parsing. IEEE Transactions on Image Processing, 30:8236–8250, 2021.
  56. Unsupervised learning of landmarks by descriptor vector exchange. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 6360–6370, 2019.
  57. Ucol: Unsupervised learning of discriminative facial representations via uncertainty-aware contrast. Proceedings of the AAAI Conference on Artificial Intelligence, 37(2):2510–2518, 2023a.
  58. Dense contrastive learning for self-supervised visual pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3024–3033, 2021.
  59. Toward high quality facial representation learning. In Proceedings of the 31st ACM International Conference on Multimedia, page 5048–5058, New York, NY, USA, 2023b. Association for Computing Machinery.
  60. Look at boundary: A boundary-aware face alignment algorithm. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2129–2138, 2018.
  61. Sparse local patch transformer for robust face alignment and landmarks inherent relation learning. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4042–4051, 2022.
  62. Propagate yourself: Exploring pixel-level consistency for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16684–16693, 2021.
  63. Stacked hourglass network for robust facial landmark localisation. In 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 2025–2033, 2017.
  64. Dense interspecies face embedding. In Advances in Neural Information Processing Systems, pages 33275–33288. Curran Associates, Inc., 2022.
  65. Unsupervised embedding learning via invariant and spreading instance feature. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6203–6212, 2019.
  66. A 3d facial expression database for facial behavior research. In 7th international conference on automatic face and gesture recognition (FGR06), pages 211–216. IEEE, 2006.
  67. Weakly-supervised text-driven contrastive learning for facial behavior understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 20751–20762, 2023.
  68. Unsupervised discovery of object landmarks as structural representations. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2694–2703, 2018.
  69. Relative uncertainty learning for facial expression recognition. In Advances in Neural Information Processing Systems, pages 17616–17627. Curran Associates, Inc., 2021.
  70. Learn from all: Erasing attention consistency for noisy label facial expression recognition. In Computer Vision – ECCV 2022, pages 418–434, Cham, 2022. Springer Nature Switzerland.
  71. Deep region and multi-label learning for facial action unit detection. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3391–3399, 2016.
  72. General facial representation learning in a visual-linguistic manner. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18676–18688, 2022.
  73. Image BERT pre-training with online tokenizer. In International Conference on Learning Representations, 2022.
  74. Star loss: Reducing semantic ambiguity in facial landmark detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15475–15484, 2023.
  75. Face alignment across large poses: A 3d solution. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 146–155, 2016.
Citations (6)

Summary

  • The paper presents FRA, a novel self-supervised method that integrates global and local facial representations using heatmap generation and deep clustering.
  • It achieves nearly 1% higher accuracy on AffectNet for facial expression recognition and outperforms traditional methods in facial attribute recognition.
  • FRA reduces reliance on large annotated datasets by enforcing semantic consistency and offers a robust framework adaptable to advanced architectures for facial analysis.

An Examination of Self-Supervised Facial Representation Learning with Facial Region Awareness

Introduction

In the domain of computer vision, the understanding of human faces is paramount yet presents significant challenges. Traditional supervised learning approaches, while effective, necessitate large-scale, meticulously annotated datasets, proving cost-prohibitive. An emerging strategy to circumvent these limitations is self-supervised learning, which leverages unlabeled data to pre-train models, enhancing their performance on downstream tasks. This paper addresses the question of whether self-supervised pre-training can learn general facial representations that support various facial analysis tasks, focusing on both global and local consistency of facial features.

Proposed Method: Facial Region Awareness (FRA)

This paper introduces a novel self-supervised facial representation learning framework called Facial Region Awareness (FRA), which integrates the concept of both global and local facial representations. By considering the consistency of facial regions (e.g., eyes, nose), FRA aims to learn more generalizable and transferable facial features.

The key components of FRA are as follows:

  1. Heatmap Generation:
    • The framework utilizes learnable positional embeddings, in conjunction with a Transformer decoder, to generate heatmaps highlighting facial regions.
    • These heatmaps are obtained via cosine similarity between pixel-level projections of feature maps and the learned facial mask embeddings, effectively capturing the attention mechanisms.
  2. Facial Mask Embeddings:
    • Heatmaps are learned through a deep clustering approach where pixel features are dynamically assigned to facial mask embeddings, serving as facial region clusters.
  3. Semantic Relations and Consistency:
    • The framework enforces semantic consistency by aligning global and local facial representations across different views.
    • This alignment is reinforced through a semantic relation loss, optimizing the pixel-level assignments between the online and momentum networks.

The integration of these components allows FRA to capture both holistic and fine-grained features of facial images, enhancing the robustness and transferability of the learned representations.

Experimental Results

The efficacy of FRA was demonstrated on multiple downstream facial analysis tasks, including facial expression recognition (FER), facial attribute recognition (FAR), and face alignment (FA). Key findings include:

  • Facial Expression Recognition:
    • FRA achieves superior performance compared with both self-supervised pre-training methods tailored for visual images and those specifically designed for facial images.
    • On the AffectNet dataset, FRA showed higher accuracy than state-of-the-art supervised learning methods by almost 1%.
  • Facial Attribute Recognition:
    • On the CelebA dataset, FRA outperformed existing self-supervised and supervised approaches, underscoring its capability to extract robust facial features pertinent to multiple attributes.
  • Face Alignment:
    • Despite using ResNet, which is generally less specialized for regression tasks compared to networks like Hourglass, FRA achieved comparable results with state-of-the-art face alignment methods.

Implications and Future Directions

FRA's ability to leverage self-supervised learning for both global and local facial representation learning has significant implications. Practically, this approach reduces the dependency on large, annotated datasets, making it more feasible to deploy in real-world applications where data labeling is a bottleneck. Theoretically, it highlights the importance of capturing local consistencies within images, which are often overlooked in self-supervised learning paradigms focusing solely on global features.

Future research could delve into optimizing the balance between local and global consistency to further enhance performance. Additionally, exploring the integration of FRA with more advanced backbone architectures, such as Vision Transformers (ViTs), could potentially push the boundaries of facial analysis tasks even further.

Conclusion

FRA represents a significant step forward in the domain of self-supervised facial representation learning. By emphasizing both global and local consistency in facial features, it sets a new benchmark for robustness and generalization across varied facial analysis tasks. This work not only provides a substantial contribution to the field but also opens new avenues for future research and application in AI-driven facial analysis.