Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Multi-Task Multi-Modal Self-Supervised Learning for Facial Expression Recognition (2404.10904v2)

Published 16 Apr 2024 in cs.CV

Abstract: Human communication is multi-modal; e.g., face-to-face interaction involves auditory signals (speech) and visual signals (face movements and hand gestures). Hence, it is essential to exploit multiple modalities when designing machine learning-based facial expression recognition systems. In addition, given the ever-growing quantities of video data that capture human facial expressions, such systems should utilize raw unlabeled videos without requiring expensive annotations. Therefore, in this work, we employ a multitask multi-modal self-supervised learning method for facial expression recognition from in-the-wild video data. Our model combines three self-supervised objective functions: First, a multi-modal contrastive loss, that pulls diverse data modalities of the same video together in the representation space. Second, a multi-modal clustering loss that preserves the semantic structure of input data in the representation space. Finally, a multi-modal data reconstruction loss. We conduct a comprehensive study on this multimodal multi-task self-supervised learning method on three facial expression recognition benchmarks. To that end, we examine the performance of learning through different combinations of self-supervised tasks on the facial expression recognition downstream task. Our model ConCluGen outperforms several multi-modal self-supervised and fully supervised baselines on the CMU-MOSEI dataset. Our results generally show that multi-modal self-supervision tasks offer large performance gains for challenging tasks such as facial expression recognition, while also reducing the amount of manual annotations required. We release our pre-trained models as well as source code publicly

ConCluGen: Advancing Multi-Task Multi-Modal Self-Supervised Learning for Facial Expression Recognition

Introduction

Facial Expression Recognition (FER) serves as a cornerstone in enhancing human-computer interaction by mirroring human-like understanding in systems. Despite significant advancements through deep learning techniques, the challenge intensifies when models are required to interpret expressions 'in the wild', with data that is not only massive but unlabeled. To address these challenges, the paper presented introduces ConCluGen, a model leveraging a multi-task, multi-modal self-supervised learning framework. This method uniquely combines multi-modal contrastive loss, multi-modal clustering loss, and reconstruction loss to learn from video, audio, and textual data without manual annotations.

Methodology

The ConCluGen framework employs separate encoders for video, text, and audio modalities to project input data into a shared latent space, facilitating the fusion of modal information. The methodology can be broken down as follows:

  • Feature Extraction and Representation: Initial features are extracted using state-of-the-art models (2D and 3D ResNet for video, DAVENet for audio, and DistilBERT for text) and are then processed to a uniform temporal resolution.
  • Multi-Task Learning Objectives:
    • Multi-Modal Contrastive Loss: Minimizes distance between different modal representations of the same instance while maximizing the distance between representations of different instances.
    • Multi-Modal Clustering Loss: Clusters embeddings from the same instance across modalities, enhancing intra-class compactness and inter-class separability.
    • Reconstruction Loss: Aims to reconstruct the original input from its embedded representation, serving as a regularizing effect and helping the model capture a generalized feature set.

Experiments and Analysis

The ConCluGen model was evaluated against several benchmarks on three FER-specific datasets, emphasizing its effectiveness through superior performance metrics compared to existing multi-modal, self-supervised, and fully supervised methods. Highlights include:

  • Datasets Used: Large-scale datasets like VoxCeleb2 for pretraining and CMU-MOSEI, CAER, and MELD for fine-tuning and testing.
  • Performance Metrics: Weighted Accuracy, F1 Score, Precision, and Recall were considered, accounting for class imbalances present in the real-world datasets.
  • Comparative Analysis: ConCluGen not only outperformed other self-supervised models but also showed competitive or superior results to fully supervised methods, particularly on the MOSEI dataset.

Implications and Future Directions

The integration of multi-task learning with multi-modal self-supervision as proposed in ConCluGen presents a significant step forward in utilizing unlabeled data effectively for complex tasks like FER. The model's ability to leverage inherent multi-modal data correlations without requiring explicit annotation is particularly valuable in scenarios where acquiring labeled data is costly or impractical.

The paper suggests several avenues for future research:

  • Expansion to Additional Modalities: Incorporating other data types such as facial landmarks could potentially enhance the model's understanding and interpretation of expressions.
  • Application to Other Tasks: Exploring the effectiveness of the ConCluGen framework on related tasks like action unit detection and sentiment analysis could broaden the model's utility.

Conclusion

This work successfully demonstrates the potential of multi-task multi-modal self-supervised learning in handling the complexities of facial expression recognition in uncontrolled environments. With ConCluGen, the research paves the way for more sophisticated, accurate, and practical FER systems, fostering advancements in how machines understand and interact with human emotions.

The full implementation of this model, along with the pre-trained weights, is made openly accessible for ongoing research and development, encouraging further exploration and adaptation of the proposed methods within the scientific community.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (60)
  1. Human-computer interaction using emotion recognition from facial expression. In 2011 UKSim 5th European Symposium on Computer Modeling and Simulation, pages 196–201, 2011.
  2. Self-supervised learning by cross-modal audio-video clustering. In Proceedings of the 34th International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 2020. Curran Associates Inc.
  3. Alex Andonian. Pretorched-x. https://github.com/alexandonian/pretorched-x, 2024.
  4. Labelling unlabelled videos from scratch with multi-modal self-supervision. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
  5. The inherently contextualized nature of facial emotion perception. Current Opinion in Psychology, 17:47–54, 2017.
  6. Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2236–2246, Melbourne, Australia, 2018. Association for Computational Linguistics.
  7. Clear judgments based on unclear evidence: Person evaluation is strongly influenced by untrustworthy gossip. Emotion, 20(2):248–260, 2020.
  8. Context-aware interactive attention for multi-modal sentiment and emotion analysis. In Conference on Empirical Methods in Natural Language Processing, 2019.
  9. Multimodal clustering networks for self-supervised learning from unlabeled videos. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 7992–8001, 2021.
  10. A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709, 2020.
  11. VoxCeleb2: Deep speaker recognition. In INTERSPEECH, 2018.
  12. Multimodal end-to-end sparse model for emotion recognition. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, pages 5305–5316. Association for Computational Linguistics, 2021.
  13. ImageNet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009.
  14. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, 2019. Association for Computational Linguistics.
  15. Multi-task self-supervised visual learning. 2017 IEEE International Conference on Computer Vision (ICCV), pages 2070–2079, 2017.
  16. Deepfake smiles matter less—the psychological and neural impact of presumed AI-generated faces. Scientific Reports, 13(1):16111, 2023.
  17. Paul Ekman. Universals and cultural differences in facial expressions of emotion. Nebraska Symposium on Motivation, 19:207–283, 1971.
  18. Context Is Routinely Encoded During Emotion Perception. Psychological science, 2010.
  19. Multimodal emotion recognition with modality-pairwise unsupervised contrastive loss. In 2022 26th International Conference on Pattern Recognition (ICPR), pages 2589–2596, Los Alamitos, CA, USA, 2022. IEEE Computer Society.
  20. Unsupervised representation learning by predicting image rotations. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net, 2018.
  21. Learning disentangled expression representations from facial images. arXiv preprint arXiv:2008.07001, 2020.
  22. Action-based contrastive learning for trajectory prediction. In Computer Vision – ECCV 2022, pages 143–159, Cham, 2022. Springer Nature Switzerland.
  23. Jointly discovering visual objects and spoken words from raw sensory input. International Journal of Computer Vision, 128:620 – 641, 2018.
  24. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2015.
  25. End-to-end learning for multimodal emotion recognition in video with adaptive loss. IEEE MultiMedia, 28(2):59–66, 2021.
  26. Large-scale representation learning from visually grounded untranscribed speech. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pages 55–65, Hong Kong, China, 2019. Association for Computational Linguistics.
  27. The kinetics human action video dataset. CoRR, abs/1705.06950, 2017.
  28. Self-supervised learning with cross-modal transformers for emotion recognition. 2021 IEEE Spoken Language Technology Workshop (SLT), pages 381–388, 2020.
  29. Self-supervision advances morphological profiling by unlocking powerful image representations. bioRxiv, 2024.
  30. Unsupervised multimodal language representations using convolutional autoencoders. CoRR, abs/2110.03007, 2021.
  31. Context Based Emotion Recognition using EMOTIC Dataset. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–1, 2019.
  32. Colorization as a proxy task for visual understanding. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017.
  33. Context-Aware Emotion Recognition Networks. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 10142–10151, 2019.
  34. Deep facial expression recognition: A survey. IEEE Transactions on Affective Computing, 13(3):1195–1215, 2022.
  35. Exploring disentangled feature representation beyond face identification. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2080–2089, 2018.
  36. Decoupled Weight Decay Regularization. In International Conference on Learning Representations, 2017a.
  37. SGDR: stochastic gradient descent with warm restarts. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017b.
  38. Multimodal information bottleneck: Learning minimal sufficient unimodal and multimodal representations. IEEE Transactions on Multimedia, pages 1–1, 2022.
  39. Knowledge-augmented face perception: Prospects for the Bayesian brain-framework to align AI and human vision. Consciousness and Cognition, 101:103301, 2022.
  40. Efficient estimation of word representations in vector space. In International Conference on Learning Representations, 2013.
  41. Unsupervised learning of visual representations by solving jigsaw puzzles. In Computer Vision – ECCV 2016, pages 69–84, Cham, 2016. Springer International Publishing.
  42. Audio-visual scene analysis with self-supervised multisensory features. In European Conference on Computer Vision, 2018.
  43. Self-supervised exploration via disagreement. In ICML, 2019.
  44. MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 527–536, 2019.
  45. Spatiotemporal contrastive video representation learning. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6960–6970, Los Alamitos, CA, USA, 2021. IEEE Computer Society.
  46. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, pages 8748–8763. PMLR, 2021.
  47. James A. Russell. Reading emotions from and into faces: Resurrecting a dimensional-contextual perspective. In The Psychology of Facial Expression, pages 295–320. Cambridge University Press, Cambridge, 1997.
  48. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. ArXiv, abs/1910.01108, 2019.
  49. Self-supervised learning for videos: A survey. ACM Computing Surveys, 55:1 – 37, 2022.
  50. Multilogue-net: A context-aware RNN for multi-modal emotion detection and sentiment analysis in conversation. In Second Grand-Challenge and Workshop on Multimodal Language (Challenge-HML), pages 19–28, Seattle, USA, 2020. Association for Computational Linguistics.
  51. Revisiting self-supervised contrastive learning for facial expression recognition. In British Machine Vision Conference, 2022.
  52. Multimodal emotion recognition with transformer-based self supervised feature fusion. IEEE Access, 8:176274–176285, 2020.
  53. Perceiving emotions in neutral faces: Expression processing is biased by affective person knowledge. Social Cognitive and Affective Neuroscience, 10(4):531–536, 2015.
  54. Videobert: A joint model for video and language representation learning. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 7463–7472, 2019.
  55. Contig: Self-supervised multimodal contrastive learning for medical imaging with genetics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20908–20921, 2022.
  56. Contrastive multiview coding. In Computer Vision – ECCV 2020, pages 776–794, Cham, 2020. Springer International Publishing.
  57. Cross-modal dynamic convolution for multi-modal emotion recognition. Journal of Visual Communication and Image Representation, 78:103178, 2021.
  58. Faces in Context: A Review and Systematization of Contextual Influences on Affective Face Processing. Frontiers in Psychology, 3, 2012.
  59. Putting Visual Object Recognition in Context. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12982–12991, Seattle, WA, USA, 2020. IEEE.
  60. Weakly supervised contrastive learning. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 10022–10031, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Marah Halawa (5 papers)
  2. Florian Blume (2 papers)
  3. Pia Bideau (10 papers)
  4. Martin Maier (11 papers)
  5. Rasha Abdel Rahman (2 papers)
  6. Olaf Hellwich (16 papers)
Citations (1)