EmoCLIP: A Vision-Language Method for Zero-Shot Video Facial Expression Recognition (2310.16640v2)
Abstract: Facial Expression Recognition (FER) is a crucial task in affective computing, but its conventional focus on the seven basic emotions limits its applicability to the complex and expanding emotional spectrum. To address the issue of new and unseen emotions present in dynamic in-the-wild FER, we propose a novel vision-LLM that utilises sample-level text descriptions (i.e. captions of the context, expressions or emotional cues) as natural language supervision, aiming to enhance the learning of rich latent representations, for zero-shot classification. To test this, we evaluate using zero-shot classification of the model trained on sample-level descriptions on four popular dynamic FER datasets. Our findings show that this approach yields significant improvements when compared to baseline methods. Specifically, for zero-shot video FER, we outperform CLIP by over 10\% in terms of Weighted Average Recall and 5\% in terms of Unweighted Average Recall on several datasets. Furthermore, we evaluate the representations obtained from the network trained using sample-level descriptions on the downstream task of mental health symptom estimation, achieving performance comparable or superior to state-of-the-art methods and strong agreement with human experts. Namely, we achieve a Pearson's Correlation Coefficient of up to 0.85 on schizophrenia symptom severity estimation, which is comparable to human experts' agreement. The code is publicly available at: https://github.com/NickyFot/EmoCLIP.
- Flamingo: a visual language model for few-shot learning. In Advances in Neural Information Processing Systems, 2022.
- Test of Time: Instilling Video-Language Models with a Sense of Time. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Apr. 2023.
- A CLIP-Hitchhiker’s Guide to Long Video Retrieval, May 2022.
- Learning Unseen Emotions from Gestures via Semantically-Conditioned Zero-Shot Perception with Adversarial Autoencoders. Proceedings of the AAAI Conference on Artificial Intelligence, 36(1):3–10, June 2022. Number: 1.
- Schinet: Automatic estimation of symptoms of schizophrenia from facial behaviour analysis. IEEE Transactions on Affective Computing, 12(4):949–961, 2019.
- Deep learning-based facial emotion recognition for human–computer interaction applications. Neural Computing and Applications, pages 1–18, 2021.
- A. S. Cowen and D. Keltner. Self-report captures 27 distinct categories of emotion bridged by continuous gradients. Proceedings of the National Academy of Sciences, 114(38):E7900–E7909, 2017.
- Collecting large, richly annotated facial-expression databases from movies. IEEE multimedia, 19(3):34, 2012.
- P. Ekman and W. V. Friesen. Facial action coding system: Investigator’s guide. Consulting Psychologists Press, 1978.
- Initial development and preliminary validation of a new negative symptom measure: The Clinical Assessment Interview for Negative Symptoms (CAINS). Schizophrenia Research, 124(1-3):36–42, Dec. 2010.
- N. M. Foteinopoulou and I. Patras. Learning from Label Relationships in Human Affect. In Proceedings of the 30th ACM International Conference on Multimedia, pages 80–89, Lisboa Portugal, Oct. 2022. ACM.
- Estimating continuous affect with label uncertainty. In 2021 9th International Conference on Affective Computing and Intelligent Interaction (ACII), pages 1–8, Nara, Japan, Sept. 2021. IEEE.
- Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, pages 4904–4916. PMLR, 2021.
- Dfew: A large-scale database for recognizing dynamic facial expressions in the wild. In Proceedings of the 28th ACM international conference on multimedia, pages 2881–2889, 2020.
- The Positive and Negative Syndrome Scale (PANSS) for Schizophrenia. Schizophrenia Bulletin, 13(2):261–276, Jan. 1987.
- CLIPER: A Unified Vision-Language Framework for In-the-Wild Facial Expression Recognition, Feb. 2023. arXiv:2303.00193 [cs].
- Nr-dfernet: Noise-robust network for dynamic facial expression recognition. arXiv preprint arXiv:2206.04975, 2022.
- FER-former: Multi-modal Transformer for Facial Expression Recognition, Mar. 2023. arXiv:2303.12997 [cs].
- Frozen CLIP Models are Efficient Video Learners. In S. Avidan, G. Brostow, M. Cissé, G. M. Farinella, and T. Hassner, editors, Computer Vision – ECCV 2022, Lecture Notes in Computer Science, pages 388–404, Cham, 2022. Springer Nature Switzerland.
- MAFW: A Large-scale, Multi-modal, Compound Affective Database for Dynamic Facial Expression Recognition in the Wild. In Proceedings of the 30th ACM International Conference on Multimedia, MM ’22, pages 24–32, New York, NY, USA, Oct. 2022. Association for Computing Machinery.
- CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning. Neurocomputing, 508:293–304, Oct. 2022.
- X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval. In Proceedings of the 30th ACM International Conference on Multimedia, MM ’22, pages 638–647. Association for Computing Machinery, Oct. 2022.
- S. Menon and C. Vondrick. Visual classification via description from large language models. ICLR, 2023.
- Black Box Few-Shot Adaptation for Vision-Language models, Apr. 2023.
- Learning to Name Classes for Vision and Language Models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Apr. 2023.
- Effectiveness of group body psychotherapy for negative symptoms of schizophrenia: multicentre randomised controlled trial. The British Journal of Psychiatry, 209(1):54–61, 2016.
- Zero-shot Video Emotion Recognition via Multimodal Protagonist-aware Transformer Network. In Proceedings of the 29th ACM International Conference on Multimedia, MM ’21, pages 1074–1083, New York, NY, USA, Oct. 2021. Association for Computing Machinery.
- Learning transferable visual models from natural language supervision. In M. Meila and T. Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR, 18–24 Jul 2021.
- J. A. Russell. A circumplex model of affect. Journal of Personality and Social Psychology, 39(6):1161–1178, 1980.
- LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs, Nov. 2021. arXiv:2111.02114 [cs].
- CLIP4Caption: CLIP for Video Caption. In Proceedings of the 29th ACM International Conference on Multimedia, MM ’21, pages 4858–4862, New York, NY, USA, Oct. 2021. Association for Computing Machinery.
- Automated facial expressions analysis in schizophrenia: A continuous dynamic approach. In International Symposium on Pervasive Computing Paradigms for Mental Health, pages 72–81. Springer, 2015.
- Facial expressions and flat affect in schizophrenia, automatic analysis from depth camera data. In 2016 IEEE-EMBS International Conference on Biomedical and Health Informatics (BHI), pages 220–223. IEEE, 2016.
- ActionCLIP: A New Paradigm for Video Action Recognition, Sept. 2021. arXiv:2109.08472 [cs].
- Ferv39k: a large-scale multi-scene dataset for facial expression recognition in videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20922–20931, 2022.
- Zero-shot learning—a comprehensive evaluation of the good, the bad and the ugly. IEEE transactions on pattern analysis and machine intelligence, 41(9):2251–2265, 2018.
- Exploring Zero-Shot Emotion Recognition in Speech Using Semantic-Embedding Prototypes. 24, 2022. Conference Name: IEEE Transactions on Multimedia.
- Zero-shot speech emotion recognition using generative learning with reconstructed prototypes. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.
- CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment, Sept. 2022. arXiv:2209.06430 [cs].
- Affective computing in education: A systematic review and future research. Computers & Education, 142:103649, 2019.
- Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning, Feb. 2023. arXiv:2302.14115 [cs].
- When and why Vision-Language Models behave like Bags-of-Words, and what to do about it? In International Conference on Learning Representations, 2023.
- AutoLabel: CLIP-based framework for Open-set Video Domain Adaptation, Apr. 2023. arXiv:2304.01110 [cs].
- Zero-shot emotion recognition via affective structural embedding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1151–1160, 2019.
- General Facial Representation Learning in a Visual-Linguistic Manner. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18676–18688, June 2022.
- General facial representation learning in a visual-linguistic manner. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18697–18709, 2022.
- Conditional prompt learning for vision-language models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Learning to prompt for vision-language models. International Journal of Computer Vision (IJCV), 2022.
- Niki Maria Foteinopoulou (5 papers)
- Ioannis Patras (73 papers)