Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Open-Set Video-based Facial Expression Recognition with Human Expression-sensitive Prompting (2404.17100v2)

Published 26 Apr 2024 in cs.CV

Abstract: In Video-based Facial Expression Recognition (V-FER), models are typically trained on closed-set datasets with a fixed number of known classes. However, these models struggle with unknown classes common in real-world scenarios. In this paper, we introduce a challenging Open-set Video-based Facial Expression Recognition (OV-FER) task, aiming to identify both known and new, unseen facial expressions. While existing approaches use large-scale vision-LLMs like CLIP to identify unseen classes, we argue that these methods may not adequately capture the subtle human expressions needed for OV-FER. To address this limitation, we propose a novel Human Expression-Sensitive Prompting (HESP) mechanism to significantly enhance CLIP's ability to model video-based facial expression details effectively. Our proposed HESP comprises three components: 1) a textual prompting module with learnable prompts to enhance CLIP's textual representation of both known and unknown emotions, 2) a visual prompting module that encodes temporal emotional information from video frames using expression-sensitive attention, equipping CLIP with a new visual modeling ability to extract emotion-rich information, and 3) an open-set multi-task learning scheme that promotes interaction between the textual and visual modules, improving the understanding of novel human emotions in video sequences. Extensive experiments conducted on four OV-FER task settings demonstrate that HESP can significantly boost CLIP's performance (a relative improvement of 17.93% on AUROC and 106.18% on OSCR) and outperform other state-of-the-art open-set video understanding methods by a large margin. Code is available at https://github.com/cosinehuang/HESP.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023).
  2. Exploring visual prompts for adapting large-scale models. arXiv preprint arXiv:2203.17274 (2022).
  3. Evidential deep learning for open set action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 13349–13358.
  4. Abhijit Bendale and Terrance E Boult. 2016. Towards open set deep networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1563–1572.
  5. Impact of deep learning approaches on facial expression recognition in healthcare industries. IEEE Transactions on Industrial Informatics 18, 8 (2022), 5619–5627.
  6. Facial expression recognition for human computer interaction. New Trends in Computational Vision and Bio-inspired Computing: Selected works presented at the ICCVBIC 2018, Coimbatore, India (2020), 1181–1192.
  7. Adversarial reciprocal points learning for open set recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 11 (2021), 8065–8081.
  8. Learning open set network with discriminative reciprocal points. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16. Springer, 507–522.
  9. On Catastrophic Inheritance of Large Foundation Models. arXiv preprint arXiv:2402.01909 (2024).
  10. Learning with Noisy Foundation Models. arXiv preprint arXiv:2403.06869 (2024).
  11. Retinaface: Single-shot multi-level face localisation in the wild. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 5203–5212.
  12. Video and image based emotion recognition challenges in the wild: Emotiw 2015. In Proceedings of the 2015 ACM on international conference on multimodal interaction. 423–426.
  13. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
  14. Spatial-Temporal Exclusive Capsule Network for Open Set Action Recognition. IEEE Transactions on Multimedia (2023).
  15. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
  16. Class-specific semantic reconstruction for open set recognition. IEEE transactions on pattern analysis and machine intelligence 45, 4 (2022), 4214–4228.
  17. Visual prompt tuning. In European Conference on Computer Vision. Springer, 709–727.
  18. Multi-objective based spatio-temporal feature representation learning robust to expression intensity variations for facial expression recognition. IEEE Transactions on Affective Computing 10, 2 (2017), 223–236.
  19. Shu Kong and Deva Ramanan. 2021. Opengan: Open-set recognition via open data generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 813–822.
  20. Cohoz: Contrastive multimodal prompt tuning for hierarchical open-set zero-shot recognition. In Proceedings of the 30th ACM International Conference on Multimedia. 3262–3271.
  21. SAANet: Siamese action-units attention network for improving dynamic facial expression recognition. Neurocomputing 413 (2020), 145–157.
  22. Mafw: A large-scale, multi-modal, compound affective database for dynamic facial expression recognition in the wild. In Proceedings of the 30th ACM International Conference on Multimedia. 24–32.
  23. Clip-aware expressive feature learning for video-based facial expression recognition. Information Sciences 598 (2022), 182–195.
  24. Expression snippet transformer for robust video-based facial expression recognition. Pattern Recognition 138 (2023), 109368.
  25. Multiple spatio-temporal feature learning for video-based emotion recognition in the wild. In Proceedings of the 20th ACM international conference on multimodal interaction. 646–652.
  26. Difficulty-aware simulator for open set recognition. In European Conference on Computer Vision. Springer, 365–381.
  27. Open set learning with counterfactual images. In Proceedings of the European conference on computer vision (ECCV). 613–628.
  28. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.
  29. A 2 Pt: Anti-Associative Prompt Tuning for Open Set Visual Recognition. IEEE Transactions on Multimedia (2023).
  30. Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34 (2021), 200–212.
  31. Attention is all you need. Advances in neural information processing systems 30 (2017).
  32. Clip the gap: A single domain generalization approach for object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 3219–3229.
  33. Open-set fine-grained retrieval via prompting vision-language evaluator. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19381–19391.
  34. Torsten Wilhelm. 2019. Towards facial expression analysis in a driver assistance system. In 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019). IEEE, 1–4.
  35. Unleashing the power of visual prompting at the pixel level. arXiv preprint arXiv:2212.10556 (2022).
  36. Building an open-vocabulary video CLIP model with better architectures, optimization and data. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024).
  37. Cpt: Colorful prompt tuning for pre-trained vision-language models. AI Open 5 (2024), 30–38.
  38. AutoLabel: CLIP-based framework for open-set video domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11504–11513.
  39. Text-visual prompting for efficient 2d temporal video grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14794–14804.
  40. Open-Set Facial Expression Recognition. arXiv preprint arXiv:2401.12507 (2024).
  41. M 3 f: Multi-modal continuous valence-arousal estimation in the wild. In 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020). IEEE, 632–636.
  42. Zengqun Zhao and Qingshan Liu. 2021. Former-dfer: Dynamic facial expression recognition transformer. In Proceedings of the 29th ACM International Conference on Multimedia. 1553–1561.
  43. Learning deep features for discriminative localization. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2921–2929.
  44. Learning to prompt for vision-language models. International Journal of Computer Vision 130, 9 (2022), 2337–2348.
  45. Variance-Aware Bi-Attention Expression Transformer for Open-Set Facial Expression Recognition in the Wild. In Proceedings of the 31st ACM International Conference on Multimedia. 862–870.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Yuanyuan Liu (75 papers)
  2. Yuxuan Huang (16 papers)
  3. Shuyang Liu (6 papers)
  4. Yibing Zhan (73 papers)
  5. Zijing Chen (6 papers)
  6. Zhe Chen (237 papers)
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com