Open-Set Video-based Facial Expression Recognition with Human Expression-sensitive Prompting (2404.17100v2)
Abstract: In Video-based Facial Expression Recognition (V-FER), models are typically trained on closed-set datasets with a fixed number of known classes. However, these models struggle with unknown classes common in real-world scenarios. In this paper, we introduce a challenging Open-set Video-based Facial Expression Recognition (OV-FER) task, aiming to identify both known and new, unseen facial expressions. While existing approaches use large-scale vision-LLMs like CLIP to identify unseen classes, we argue that these methods may not adequately capture the subtle human expressions needed for OV-FER. To address this limitation, we propose a novel Human Expression-Sensitive Prompting (HESP) mechanism to significantly enhance CLIP's ability to model video-based facial expression details effectively. Our proposed HESP comprises three components: 1) a textual prompting module with learnable prompts to enhance CLIP's textual representation of both known and unknown emotions, 2) a visual prompting module that encodes temporal emotional information from video frames using expression-sensitive attention, equipping CLIP with a new visual modeling ability to extract emotion-rich information, and 3) an open-set multi-task learning scheme that promotes interaction between the textual and visual modules, improving the understanding of novel human emotions in video sequences. Extensive experiments conducted on four OV-FER task settings demonstrate that HESP can significantly boost CLIP's performance (a relative improvement of 17.93% on AUROC and 106.18% on OSCR) and outperform other state-of-the-art open-set video understanding methods by a large margin. Code is available at https://github.com/cosinehuang/HESP.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023).
- Exploring visual prompts for adapting large-scale models. arXiv preprint arXiv:2203.17274 (2022).
- Evidential deep learning for open set action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 13349–13358.
- Abhijit Bendale and Terrance E Boult. 2016. Towards open set deep networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1563–1572.
- Impact of deep learning approaches on facial expression recognition in healthcare industries. IEEE Transactions on Industrial Informatics 18, 8 (2022), 5619–5627.
- Facial expression recognition for human computer interaction. New Trends in Computational Vision and Bio-inspired Computing: Selected works presented at the ICCVBIC 2018, Coimbatore, India (2020), 1181–1192.
- Adversarial reciprocal points learning for open set recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 11 (2021), 8065–8081.
- Learning open set network with discriminative reciprocal points. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16. Springer, 507–522.
- On Catastrophic Inheritance of Large Foundation Models. arXiv preprint arXiv:2402.01909 (2024).
- Learning with Noisy Foundation Models. arXiv preprint arXiv:2403.06869 (2024).
- Retinaface: Single-shot multi-level face localisation in the wild. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 5203–5212.
- Video and image based emotion recognition challenges in the wild: Emotiw 2015. In Proceedings of the 2015 ACM on international conference on multimodal interaction. 423–426.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
- Spatial-Temporal Exclusive Capsule Network for Open Set Action Recognition. IEEE Transactions on Multimedia (2023).
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
- Class-specific semantic reconstruction for open set recognition. IEEE transactions on pattern analysis and machine intelligence 45, 4 (2022), 4214–4228.
- Visual prompt tuning. In European Conference on Computer Vision. Springer, 709–727.
- Multi-objective based spatio-temporal feature representation learning robust to expression intensity variations for facial expression recognition. IEEE Transactions on Affective Computing 10, 2 (2017), 223–236.
- Shu Kong and Deva Ramanan. 2021. Opengan: Open-set recognition via open data generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 813–822.
- Cohoz: Contrastive multimodal prompt tuning for hierarchical open-set zero-shot recognition. In Proceedings of the 30th ACM International Conference on Multimedia. 3262–3271.
- SAANet: Siamese action-units attention network for improving dynamic facial expression recognition. Neurocomputing 413 (2020), 145–157.
- Mafw: A large-scale, multi-modal, compound affective database for dynamic facial expression recognition in the wild. In Proceedings of the 30th ACM International Conference on Multimedia. 24–32.
- Clip-aware expressive feature learning for video-based facial expression recognition. Information Sciences 598 (2022), 182–195.
- Expression snippet transformer for robust video-based facial expression recognition. Pattern Recognition 138 (2023), 109368.
- Multiple spatio-temporal feature learning for video-based emotion recognition in the wild. In Proceedings of the 20th ACM international conference on multimodal interaction. 646–652.
- Difficulty-aware simulator for open set recognition. In European Conference on Computer Vision. Springer, 365–381.
- Open set learning with counterfactual images. In Proceedings of the European conference on computer vision (ECCV). 613–628.
- Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.
- A 2 Pt: Anti-Associative Prompt Tuning for Open Set Visual Recognition. IEEE Transactions on Multimedia (2023).
- Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34 (2021), 200–212.
- Attention is all you need. Advances in neural information processing systems 30 (2017).
- Clip the gap: A single domain generalization approach for object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 3219–3229.
- Open-set fine-grained retrieval via prompting vision-language evaluator. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19381–19391.
- Torsten Wilhelm. 2019. Towards facial expression analysis in a driver assistance system. In 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019). IEEE, 1–4.
- Unleashing the power of visual prompting at the pixel level. arXiv preprint arXiv:2212.10556 (2022).
- Building an open-vocabulary video CLIP model with better architectures, optimization and data. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024).
- Cpt: Colorful prompt tuning for pre-trained vision-language models. AI Open 5 (2024), 30–38.
- AutoLabel: CLIP-based framework for open-set video domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11504–11513.
- Text-visual prompting for efficient 2d temporal video grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14794–14804.
- Open-Set Facial Expression Recognition. arXiv preprint arXiv:2401.12507 (2024).
- M 3 f: Multi-modal continuous valence-arousal estimation in the wild. In 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020). IEEE, 632–636.
- Zengqun Zhao and Qingshan Liu. 2021. Former-dfer: Dynamic facial expression recognition transformer. In Proceedings of the 29th ACM International Conference on Multimedia. 1553–1561.
- Learning deep features for discriminative localization. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2921–2929.
- Learning to prompt for vision-language models. International Journal of Computer Vision 130, 9 (2022), 2337–2348.
- Variance-Aware Bi-Attention Expression Transformer for Open-Set Facial Expression Recognition in the Wild. In Proceedings of the 31st ACM International Conference on Multimedia. 862–870.
- Yuanyuan Liu (75 papers)
- Yuxuan Huang (16 papers)
- Shuyang Liu (6 papers)
- Yibing Zhan (73 papers)
- Zijing Chen (6 papers)
- Zhe Chen (237 papers)