MMA-DFER: MultiModal Adaptation of unimodal models for Dynamic Facial Expression Recognition in-the-wild (2404.09010v1)
Abstract: Dynamic Facial Expression Recognition (DFER) has received significant interest in the recent years dictated by its pivotal role in enabling empathic and human-compatible technologies. Achieving robustness towards in-the-wild data in DFER is particularly important for real-world applications. One of the directions aimed at improving such models is multimodal emotion recognition based on audio and video data. Multimodal learning in DFER increases the model capabilities by leveraging richer, complementary data representations. Within the field of multimodal DFER, recent methods have focused on exploiting advances of self-supervised learning (SSL) for pre-training of strong multimodal encoders. Another line of research has focused on adapting pre-trained static models for DFER. In this work, we propose a different perspective on the problem and investigate the advancement of multimodal DFER performance by adapting SSL-pre-trained disjoint unimodal encoders. We identify main challenges associated with this task, namely, intra-modality adaptation, cross-modal alignment, and temporal adaptation, and propose solutions to each of them. As a result, we demonstrate improvement over current state-of-the-art on two popular DFER benchmarks, namely DFEW and MFAW.
- A systematic survey on multimodal emotion recognition using learning algorithms. Intelligent Systems with Applications, 17:200171, 2023.
- Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35:23716–23736, 2022.
- Emotion recognition from multimodal physiological signals for emotion aware healthcare systems. Journal of Medical and Biological Engineering, 40:149–157, 2020.
- Rezero is all you need: Fast convergence at large depth. In Uncertainty in Artificial Intelligence, pages 1352–1361. PMLR, 2021.
- wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449–12460, 2020.
- Impact of deep learning approaches on facial expression recognition in healthcare industries. IEEE Transactions on Industrial Informatics, 18(8):5619–5627, 2022.
- Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518, 2022.
- From static to dynamic: Adapting landmark-aware image models for facial expression recognition in videos. arXiv preprint arXiv:2312.05447, 2023.
- Deep learning-based facial emotion recognition for human–computer interaction applications. Neural Computing and Applications, 35(32):23311–23328, 2023.
- Self-attention fusion for audiovisual emotion recognition with incomplete data. In 2022 26th International Conference on Pattern Recognition (ICPR), pages 2822–2828. IEEE, 2022.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 776–780. IEEE, 2017.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022.
- Self-supervised speech representation learning by masked prediction of hidden units. ieee. ACM Trans. Audio Speech Lang. Process, 29:3451–3460, 2021.
- Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7132–7141, 2018.
- Masked autoencoders that listen. Advances in Neural Information Processing Systems, 35:28708–28720, 2022.
- Visual prompt tuning. In European Conference on Computer Vision, pages 709–727. Springer, 2022.
- Dfew: A large-scale database for recognizing dynamic facial expressions in the wild. In Proceedings of the 28th ACM International Conference on Multimedia, pages 2881–2889, 2020.
- Maple: Multi-modal prompt learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19113–19122, 2023.
- Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
- Abaw: Valence-arousal estimation, expression recognition, action unit detection and emotional reaction intensity estimation challenges. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5888–5897, 2023.
- Nr-dfernet: Noise-robust network for dynamic facial expression recognition. arXiv preprint arXiv:2206.04975, 2022.
- Intensity-aware loss for dynamic facial expression recognition in the wild. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 67–75, 2023a.
- Cliper: A unified vision-language framework for in-the-wild facial expression recognition. arXiv preprint arXiv:2303.00193, 2023b.
- Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190, 2021.
- Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1–35, 2023a.
- P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks. arXiv preprint arXiv:2110.07602, 2021.
- Mafw: A large-scale, multi-modal, compound affective database for dynamic facial expression recognition in the wild. In Proceedings of the 30th ACM International Conference on Multimedia (MM’22).
- Clip-aware expressive feature learning for video-based facial expression recognition. Information Sciences, 598:182–195, 2022.
- Expression snippet transformer for robust video-based facial expression recognition. Pattern Recognition, 138:109368, 2023b.
- A facial expression emotion recognition based human-robot interaction system. IEEE CAA J. Autom. Sinica, 4(4):668–676, 2017.
- A unified approach to facial affect analysis: the mae-face visual representation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5923–5932, 2023.
- Spatio-temporal transformer for dynamic facial expression recognition in the wild. arXiv preprint arXiv:2205.04749, 2022.
- Multimodal emotion recognition using cross modal audio-video fusion with attention and deep metric learning. Image and Vision Computing, 133:104676, 2023.
- Affectnet: A database for facial expression, valence, and arousal computing in the wild. IEEE Transactions on Affective Computing, 10(1):18–31, 2017.
- A review of multimodal emotion recognition from datasets, preprocessing, features, and fusion methods. Neurocomputing, page 126866, 2023.
- Mae-dfer: Efficient masked autoencoder for self-supervised dynamic facial expression recognition. In Proceedings of the 31st ACM International Conference on Multimedia, pages 6110–6121, 2023a.
- Svfap: Self-supervised video facial affect perceiver. arXiv preprint arXiv:2401.00416, 2023b.
- Hicmae: Hierarchical contrastive masked autoencoder for self-supervised audio-visual emotion recognition. arXiv preprint arXiv:2401.05698, 2024.
- Cold fusion: Calibrated and ordinal latent distribution fusion for uncertainty-aware multimodal emotion recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
- Emotion-driven analysis and control of human-robot interactions in collaborative applications. Sensors, 21(14):4626, 2021.
- Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 4489–4497, 2015.
- A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 6450–6459, 2018.
- Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the conference. Association for computational linguistics. Meeting, page 6558. NIH Public Access, 2019.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Rethinking the learning paradigm for dynamic facial expression recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17958–17968, 2023.
- Attentive modality hopping mechanism for speech emotion recognition. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3362–3366. IEEE, 2020.
- Driver emotion recognition for intelligent vehicles: A survey. ACM Computing Surveys (CSUR), 53(3):1–30, 2020.
- Joint face detection and alignment using multitask cascaded convolutional networks. IEEE signal processing letters, 23(10):1499–1503, 2016.
- Transformer-based multimodal emotional perception for dynamic facial expression recognition in the wild. IEEE Transactions on Circuits and Systems for Video Technology, 2023.
- Former-dfer: Dynamic facial expression recognition transformer. In Proceedings of the 29th ACM International Conference on Multimedia, pages 1553–1561, 2021.
- Prompting visual-language models for dynamic facial expression recognition. arXiv preprint arXiv:2308.13382, 2023a.
- Prompting visual-language models for dynamic facial expression recognition. arXiv preprint arXiv:2308.13382, 2023b.
- Kateryna Chumachenko (11 papers)
- Alexandros Iosifidis (153 papers)
- Moncef Gabbouj (167 papers)