FineCLIPER: Multi-modal Fine-grained CLIP for Dynamic Facial Expression Recognition with AdaptERs (2407.02157v2)
Abstract: Dynamic Facial Expression Recognition (DFER) is crucial for understanding human behavior. However, current methods exhibit limited performance mainly due to the scarcity of high-quality data, the insufficient utilization of facial dynamics, and the ambiguity of expression semantics, etc. To this end, we propose a novel framework, named Multi-modal Fine-grained CLIP for Dynamic Facial Expression Recognition with AdaptERs (FineCLIPER), incorporating the following novel designs: 1) To better distinguish between similar facial expressions, we extend the class labels to textual descriptions from both positive and negative aspects, and obtain supervision by calculating the cross-modal similarity based on the CLIP model; 2) Our FineCLIPER adopts a hierarchical manner to effectively mine useful cues from DFE videos. Specifically, besides directly embedding video frames as input (low semantic level), we propose to extract the face segmentation masks and landmarks based on each frame (middle semantic level) and utilize the Multi-modal LLM (MLLM) to further generate detailed descriptions of facial changes across frames with designed prompts (high semantic level). Additionally, we also adopt Parameter-Efficient Fine-Tuning (PEFT) to enable efficient adaptation of large pre-trained models (i.e., CLIP) for this task. Our FineCLIPER achieves SOTA performance on the DFEW, FERV39k, and MAFW datasets in both supervised and zero-shot settings with few tunable parameters. Project Page: https://haroldchen19.github.io/FineCLIPER-Page/
- Wissam J Baddar and Yong Man Ro. 2019. Mode variational lstm robust to unseen modes of variation: Application to facial expression recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 3215–3223.
- Impact of Deep Learning Approaches on Facial Expression Recognition in Healthcare Industries. IEEE Transactions on Industrial Informatics 18, 8 (2022), 5619–5627. https://doi.org/10.1109/TII.2022.3141400
- CREMA-D: Crowd-Sourced Emotional Multimodal Actors Dataset. IEEE Transactions on Affective Computing 5, 4 (2014), 377–390. https://doi.org/10.1109/TAFFC.2014.2336244
- GaussianVTON: 3D Human Virtual Try-ON via Multi-Stage Gaussian Splatting Editing with Image Prompting. arXiv preprint arXiv:2405.07472 (2024).
- From static to dynamic: Adapting landmark-aware image models for facial expression recognition in videos. arXiv preprint arXiv:2312.05447 (2023).
- Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014).
- Prompt switch: Efficient clip adaptation for text-video retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 15648–15658.
- DreamArtist: Towards Controllable One-Shot Text-to-Image Generation via Positive-Negative Prompt-Tuning. arXiv preprint arXiv:2211.11337 (2022).
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
- Recurrent neural networks for emotion recognition in video. In Proceedings of the 2015 ACM on international conference on multimodal interaction. 467–474.
- Video-based emotion recognition using CNN-RNN and C3D hybrid networks. In Proceedings of the 18th ACM international conference on multimodal interaction. 445–450.
- Niki Maria Foteinopoulou and Ioannis Patras. 2024. EmoCLIP: A Vision-Language Method for Zero-Shot Video Facial Expression Recognition. In The 18th IEEE International Conference on Automatic Face and Gesture Recognition.
- Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 6546–6555.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
- Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735–1780.
- Parameter-efficient transfer learning for NLP. In International conference on machine learning. PMLR, 2790–2799.
- CREST: Cross-modal Resonance through Evidential Deep Learning for Enhanced Zero-Shot Learning. arXiv preprint arXiv:2404.09640 (2024).
- Shapeclipper: Scalable 3d shape learning from single-view images via geometric and clip-based consistency. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 12912–12922.
- Love, joy, and autism robots: A metareview and provocatype. arXiv preprint arXiv:2403.05098 (2024).
- Local 3d editing via 3d distillation of clip knowledge. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12674–12684.
- Dfew: A large-scale database for recognizing dynamic facial expressions in the wild. In Proceedings of the 28th ACM international conference on multimedia. 2881–2889.
- Dimitrios Kollias and Stefanos Zafeiriou. 2020. Exploiting multi-cnn features in cnn-rnn based dimensional emotion recognition on the omg in-the-wild dataset. IEEE Transactions on Affective Computing 12, 3 (2020), 595–606.
- Factorized higher-order cnns with an application to spatio-temporal emotion estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 6060–6069.
- Measuring Non-Typical Emotions for Mental Health: A Survey of Computational Approaches. arXiv preprint arXiv:2403.08824 (2024).
- Which Artificial Intelligences Do People Care About Most? A Conjoint Experiment on Moral Consideration. In Proceedings of the CHI Conference on Human Factors in Computing Systems. 1–11.
- Frame level emotion guided dynamic facial expression recognition with emotion grouping. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5680–5690.
- Context-aware emotion recognition networks. In Proceedings of the IEEE/CVF international conference on computer vision. 10143–10152.
- Cliper: A unified vision-language framework for in-the-wild facial expression recognition. arXiv preprint arXiv:2303.00193 (2023).
- Intensity-aware loss for dynamic facial expression recognition in the wild. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 67–75.
- Nr-dfernet: Noise-robust network for dynamic facial expression recognition. arXiv preprint arXiv:2206.04975 (2022).
- VideoChat: Chat-Centric Video Understanding. arXiv preprint arXiv:2305.06355 (2023).
- Mvbench: A comprehensive multi-modal video understanding benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 22195–22206.
- Shan Li and Weihong Deng. 2022. Deep Facial Expression Recognition: A Survey. IEEE Transactions on Affective Computing 13, 3 (July 2022), 1195–1215. https://doi.org/10.1109/taffc.2020.2981446
- Video-llava: Learning united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122 (2023).
- Isotropic3d: Image-to-3d generation based on a single clip embedding. arXiv preprint arXiv:2403.10395 (2024).
- Mafw: A large-scale, multi-modal, compound affective database for dynamic facial expression recognition in the wild. In Proceedings of the 30th ACM International Conference on Multimedia. 24–32.
- Clip-aware expressive feature learning for video-based facial expression recognition. Information Sciences 598 (2022), 182–195.
- Expression snippet transformer for robust video-based facial expression recognition. Pattern Recognition 138 (2023), 109368.
- A facial expression emotion recognition based human-robot interaction system. IEEE/CAA Journal of Automatica Sinica 4, 4 (2017), 668–676. https://doi.org/10.1109/JAS.2017.7510622
- Steven R. Livingstone and Frank A. Russo. 2018a. The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLOS ONE 13, 5 (05 2018), 1–35. https://doi.org/10.1371/journal.pone.0196391
- Steven R. Livingstone and Frank A. Russo. 2018b. The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 13 (2018). https://api.semanticscholar.org/CorpusID:21704094
- Spatio-temporal transformer for dynamic facial expression recognition in the wild. arXiv preprint arXiv:2205.04749 (2022).
- Logo-former: Local-global spatio-temporal transformer for dynamic facial expression recognition. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5.
- The eNTERFACE’ 05 Audio-Visual Emotion Database. In 22nd International Conference on Data Engineering Workshops (ICDEW’06). 8–8. https://doi.org/10.1109/ICDEW.2006.145
- Negative-prompt inversion: Fast image inversion for editing with text-guided diffusion models. arXiv preprint arXiv:2305.16807 (2023).
- AffectNet: A Database for Facial Expression, Valence, and Arousal Computing in the Wild. IEEE Transactions on Affective Computing 10, 1 (Jan. 2019), 18–31. https://doi.org/10.1109/taffc.2017.2740923
- FaceXFormer: A Unified Transformer for Facial Analysis. arXiv preprint arXiv:2403.12960 (2024).
- Expanding language-image pretrained models for general video recognition. In European Conference on Computer Vision. Springer, 1–18.
- Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.
- Towards explainable and safe conversational agents for mental health: A survey. arXiv preprint arXiv:2304.13191 (2023).
- Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
- Mae-dfer: Efficient masked autoencoder for self-supervised dynamic facial expression recognition. In Proceedings of the 31st ACM International Conference on Multimedia. 6110–6121.
- Multi-modal continuous dimensional emotion recognition using recurrent neural network and self-attention mechanism. In Proceedings of the 1st international on multimodal sentiment analysis in real-life media challenge and workshop. 27–34.
- Freq-HD: An Interpretable Frequency-based High-Dynamics Affective Clip Selection Method for in-the-Wild Facial Expression Recognition in Videos. In Proceedings of the 31st ACM International Conference on Multimedia (¡conf-loc¿, ¡city¿Ottawa ON¡/city¿, ¡country¿Canada¡/country¿, ¡/conf-loc¿) (MM ’23). Association for Computing Machinery, New York, NY, USA, 843–852. https://doi.org/10.1145/3581783.3611972
- A3lign-DFER: Pioneering Comprehensive Dynamic Affective Alignment for Dynamic Facial Expression Recognition with CLIP. arXiv:2403.04294 [cs.CV] https://arxiv.org/abs/2403.04294
- Global to local: Clip-LSTM-based object detection from remote sensing images. IEEE Transactions on Geoscience and Remote Sensing 60 (2021), 1–13.
- Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision. 4489–4497.
- A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 6450–6459.
- Driver abnormal behavior detection system using two-stage object detection. (2023).
- Rethinking the learning paradigm for dynamic facial expression recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 17958–17968.
- M2-CLIP: A Multimodal, Multi-task Adapting Framework for Video Action Recognition. arXiv preprint arXiv:2401.11649 (2024).
- A systematic review on affective computing: Emotion models, databases, and recent advances. Information Fusion 83 (2022), 19–52.
- Ferv39k: A large-scale multi-scene dataset for facial expression recognition in videos. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 20922–20931.
- DPCNet: Dual Path Multi-Excitation Collaborative Network for Facial Expression Representation Learning in Videos. In Proceedings of the 30th ACM International Conference on Multimedia (¡conf-loc¿, ¡city¿Lisboa¡/city¿, ¡country¿Portugal¡/country¿, ¡/conf-loc¿) (MM ’22). Association for Computing Machinery, New York, NY, USA, 101–110. https://doi.org/10.1145/3503161.3547865
- Torsten Wilhelm. 2019. Towards facial expression analysis in a driver assistance system. In 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019). IEEE, 1–4.
- Empower smart cities with sampling-wise dynamic facial expression recognition via frame-sequence contrastive learning. Computer Communications 216 (2024), 130–139.
- UrbanCLIP: Learning Text-enhanced Urban Region Profiling with Contrastive Language-Image Pretraining from the Web. In Proceedings of the ACM on Web Conference 2024. 4006–4017.
- Aim: Adapting image models for efficient video action recognition. arXiv preprint arXiv:2302.03024 (2023).
- TF-CLIP: Learning text-free CLIP for video-based person re-identification. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 6764–6772.
- Long-clip: Unlocking the long-text capability of clip. arXiv preprint arXiv:2403.15378 (2024).
- Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding. arXiv preprint arXiv:2306.02858 (2023). https://arxiv.org/abs/2306.02858
- Zengqun Zhao and Qingshan Liu. 2021a. Former-dfer: Dynamic facial expression recognition transformer. In Proceedings of the 29th ACM International Conference on Multimedia. 1553–1561.
- Zengqun Zhao and Qingshan Liu. 2021b. Former-DFER: Dynamic Facial Expression Recognition Transformer. In Proceedings of the 29th ACM International Conference on Multimedia (Virtual Event, China) (MM ’21). Association for Computing Machinery, New York, NY, USA, 1553–1561. https://doi.org/10.1145/3474085.3475292
- Zengqun Zhao and Ioannis Patras. 2023. Prompting visual-language models for dynamic facial expression recognition. arXiv preprint arXiv:2308.13382 (2023).
- General facial representation learning in a visual-linguistic manner. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 18697–18709.
- Haodong Chen (22 papers)
- Haojian Huang (16 papers)
- Junhao Dong (21 papers)
- Mingzhe Zheng (5 papers)
- Dian Shao (11 papers)