Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Enhancing Zero-Shot Facial Expression Recognition by LLM Knowledge Transfer (2405.19100v3)

Published 29 May 2024 in cs.CV

Abstract: Current facial expression recognition (FER) models are often designed in a supervised learning manner and thus are constrained by the lack of large-scale facial expression images with high-quality annotations. Consequently, these models often fail to generalize well, performing poorly on unseen images in inference. Vision-language-based zero-shot models demonstrate a promising potential for addressing such challenges. However, these models lack task-specific knowledge and therefore are not optimized for the nuances of recognizing facial expressions. To bridge this gap, this work proposes a novel method, Exp-CLIP, to enhance zero-shot FER by transferring the task knowledge from LLMs. Specifically, based on the pre-trained vision-language encoders, we incorporate a projection head designed to map the initial joint vision-language space into a space that captures representations of facial actions. To train this projection head for subsequent zero-shot predictions, we propose to align the projected visual representations with task-specific semantic meanings derived from the LLM encoder, and the text instruction-based strategy is employed to customize the LLM knowledge. Given unlabelled facial data and efficient training of the projection head, Exp-CLIP achieves superior zero-shot results to the CLIP models and several other large vision-LLMs (LVLMs) on seven in-the-wild FER datasets. The code and pre-trained models are available at https://github.com/zengqunzhao/Exp-CLIP.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. J. L. Joseph and S. P. Mathew, “Facial expression recognition for the blind using deep learning,” in IEEE International Conference on Computing, Power and Communication Technologies, 2021, pp. 1–5.
  2. M. Bishay, P. Palasek, S. Priebe, and I. Patras, “Schinet: Automatic estimation of symptoms of schizophrenia from facial behaviour analysis,” IEEE TAFFC, vol. 12, no. 4, pp. 949–961, 2019.
  3. L. Maat and M. Pantic, “Gaze-x: Adaptive, affective, multimodal interface for single-user office scenarios,” in Artifical Intelligence for Human Computing, 2007, pp. 251–271.
  4. S. Li and W. Deng, “Deep facial expression recognition: A survey,” IEEE TAFFC, vol. 13, no. 3, pp. 1195–1215, 2020.
  5. S. Li and W. Deng, “Reliable crowdsourcing and deep locality-preserving learning for unconstrained facial expression recognition,” IEEE TIP, vol. 28, no. 1, pp. 356–370, 2018.
  6. A. Mollahosseini, B. Hasani, and M. H. Mahoor, “Affectnet: A database for facial expression, valence, and arousal computing in the wild,” IEEE TAFFC, vol. 10, no. 1, pp. 18–31, 2017.
  7. X. Jiang, Y. Zong, W. Zheng, C. Tang, W. Xia, C. Lu, and J. Liu, “Dfew: A large-scale database for recognizing dynamic facial expressions in the wild,” in ACM MM, 2020, pp. 2881–2889.
  8. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016, pp. 770–778.
  9. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in ICLR, 2020.
  10. Y. Zhang, C. Wang, X. Ling, and W. Deng, “Learn from all: Erasing attention consistency for noisy label facial expression recognition,” in ECCV, 2022, pp. 418–434.
  11. X. Zhang, T. Wang, X. Li, H. Yang, and L. Yin, “Weakly-supervised text-driven contrastive learning for facial behavior understanding,” arXiv preprint arXiv:2304.00058, 2023.
  12. H. Li, H. Niu, Z. Zhu, and F. Zhao, “Intensity-aware loss for dynamic facial expression recognition in the wild,” in AAAI, 2023, pp. 67–75.
  13. H. Wang, B. Li, S. Wu, S. Shen, F. Liu, S. Ding, and A. Zhou, “Rethinking the learning paradigm for dynamic facial expression recognition,” in CVPR, 2023, pp. 17 958–17 968.
  14. T. Chen, T. Pu, H. Wu, Y. Xie, L. Liu, and L. Lin, “Cross-domain facial expression recognition: A unified evaluation benchmark and adversarial graph learning,” IEEE TPAMI, vol. 44, no. 12, pp. 9887–9903, 2021.
  15. P. Ekman and W. V. Friesen, “Constants across cultures in the face and emotion.” Journal of Personality and Social Psychology, vol. 17, no. 2, p. 124, 1971.
  16. D. Matsumoto, “More evidence for the universality of a contempt expression,” Motivation and Emotion, vol. 16, pp. 363–368, 1992.
  17. S. Du, Y. Tao, and A. M. Martinez, “Compound facial expressions of emotion,” PNAS, vol. 111, no. 15, pp. E1454–E1462, 2014.
  18. A. S. Cowen, D. Keltner, F. Schroff, B. Jou, H. Adam, and G. Prasad, “Sixteen facial expressions occur in similar contexts worldwide,” Nature, vol. 589, no. 7841, pp. 251–257, 2021.
  19. F. Pourpanah, M. Abdar, Y. Luo, X. Zhou, R. Wang, C. P. Lim, X.-Z. Wang, and Q. J. Wu, “A review of generalized zero-shot learning methods,” IEEE TPAMI, 2022.
  20. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in ICML, 2021, pp. 8748–8763.
  21. Y. Zheng, H. Yang, T. Zhang, J. Bao, D. Chen, Y. Huang, L. Yuan, D. Chen, M. Zeng, and F. Wen, “General facial representation learning in a visual-linguistic manner,” in CVPR, 2022, pp. 18 697–18 709.
  22. W. Wu, H. Yao, M. Zhang, Y. Song, W. Ouyang, and J. Wang, “Gpt4vis: What can gpt-4 do for zero-shot visual recognition?” arXiv preprint arXiv:2311.15732, 2023.
  23. J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,” arXiv preprint arXiv:2301.12597, 2023.
  24. A. Singh, R. Hu, V. Goswami, G. Couairon, W. Galuba, M. Rohrbach, and D. Kiela, “Flava: A foundational language and vision alignment model,” in CVPR, 2022, pp. 15 638–15 650.
  25. C. Li, C. Wong, S. Zhang, N. Usuyama, H. Liu, J. Yang, T. Naumann, H. Poon, and J. Gao, “Llava-med: Training a large language-and-vision assistant for biomedicine in one day,” NeurIPS, vol. 36, 2024.
  26. J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds et al., “Flamingo: a visual language model for few-shot learning,” NeurIPS, vol. 35, pp. 23 716–23 736, 2022.
  27. S. Zhang, Y. Pan, and J. Z. Wang, “Learning emotion representations from verbal and nonverbal communication,” in CVPR, 2023, pp. 18 993–19 004.
  28. N. M. Foteinopoulou and I. Patras, “Emoclip: A vision-language method for zero-shot video facial expression recognition,” in FG, 2024.
  29. J. Gu, Z. Han, S. Chen, A. Beirami, B. He, G. Zhang, R. Liao, Y. Qin, V. Tresp, and P. Torr, “A systematic survey of prompt engineering on vision-language foundation models,” arXiv preprint arXiv:2307.12980, 2023.
  30. V. Rawte, A. Sheth, and A. Das, “A survey of hallucination in large foundation models,” arXiv preprint arXiv:2309.05922, 2023.
  31. Z. Zhao, Q. Liu, and S. Wang, “Learning deep global multi-scale and local attention features for facial expression recognition in the wild,” IEEE TIP, vol. 30, pp. 6544–6556, 2021.
  32. C. Wang, S. Wang, and G. Liang, “Identity-and pose-robust facial expression recognition through adversarial feature learning,” in ACM MM, 2019, pp. 238–246.
  33. K. Wang, X. Peng, J. Yang, S. Lu, and Y. Qiao, “Suppressing uncertainties for large-scale facial expression recognition,” in CVPR, 2020, pp. 6897–6906.
  34. Z. Zhao, Q. Liu, and F. Zhou, “Robust lightweight facial expression recognition network with label distribution training,” in AAAI, 2021, pp. 3510–3519.
  35. Z. Wu and J. Cui, “La-net: Landmark-aware learning for reliable facial expression recognition under label noise,” in ICCV, 2023, pp. 20 698–20 707.
  36. I. Lee, E. Lee, and S. B. Yoo, “Latent-ofer: Detect, mask, and reconstruct with latent vectors for occluded facial expression recognition,” in ICCV, 2023, pp. 1536–1546.
  37. Y. Zhang, Y. Li, X. Liu, W. Deng et al., “Leave no stone unturned: Mine extra knowledge for imbalanced facial expression recognition,” NeurIPS, vol. 36, 2024.
  38. Z. Zhao and Q. Liu, “Former-dfer: Dynamic facial expression recognition transformer,” in ACM MM, 2021, pp. 1553–1561.
  39. H. Li, M. Sui, Z. Zhu et al., “Nr-dfernet: Noise-robust network for dynamic facial expression recognition,” arXiv preprint arXiv:2206.04975, 2022.
  40. R. Kawamura, H. Hayashi, N. Takemura, and H. Nagahara, “Midas: Mixing ambiguous data with soft labels for dynamic facial expression recognition,” in WACV, 2024, pp. 6552–6562.
  41. H. Li, H. Niu, Z. Zhu, and F. Zhao, “Cliper: A unified vision-language framework for in-the-wild facial expression recognition,” arXiv preprint arXiv:2303.00193, 2023.
  42. Z. Zhao and I. Patras, “Prompting visual-language models for dynamic facial expression recognition,” in BMVC, 2023, pp. 1–14.
  43. H. Xu, G. Ghosh, P.-Y. Huang, D. Okhonko, A. Aghajanyan, F. Metze, L. Zettlemoyer, and C. Feichtenhofer, “Videoclip: Contrastive pre-training for zero-shot video-text understanding,” in EMNLP, 2021, pp. 6787–6800.
  44. B. Ni, H. Peng, M. Chen, S. Zhang, G. Meng, J. Fu, S. Xiang, and H. Ling, “Expanding language-image pretrained models for general video recognition,” in ECCV, 2022, pp. 1–18.
  45. J. Oldfield, C. Tzelepis, Y. Panagakis, M. Nicolaou, and I. Patras, “Parts of speech–grounded subspaces in vision-language models,” NeurIPS, vol. 36, pp. 2700–2724, 2023.
  46. A. Xenos, N. M. Foteinopoulou, I. Ntinou, I. Patras, and G. Tzimiropoulos, “Vllms provide better context for emotion understanding through common sense reasoning,” arXiv preprint arXiv:2404.07078, 2024.
  47. E. Barsoum, C. Zhang, C. C. Ferrer, and Z. Zhang, “Training deep networks for facial expression recognition with crowd-sourced label distribution,” in ICMI, 2016, pp. 279–283.
  48. Y. Wang, Y. Sun, Y. Huang, Z. Liu, S. Gao, W. Zhang, W. Ge, and W. Zhang, “Ferv39k: a large-scale multi-scene dataset for facial expression recognition in videos,” in CVPR, 2022, pp. 20 922–20 931.
  49. Y. Liu, W. Dai, C. Feng, W. Wang, G. Yin, J. Zeng, and S. Shan, “Mafw: A large-scale, multi-modal, compound affective database for dynamic facial expression recognition in the wild,” in ACM MM, 2022, pp. 24–32.
  50. A. Dhall, R. Goecke, S. Lucey, and T. Gedeon, “Collecting large, richly annotated facial-expression databases from movies,” IEEE Multimedia, vol. 9, no. 3, pp. 34–41, 2012.
  51. J. Lee, S. Kim, S. Kim, J. Park, and K. Sohn, “Context-aware emotion recognition networks,” in ICCV, 2019, pp. 10 143–10 152.
  52. H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, Y. Li, X. Wang, M. Dehghani, S. Brahma et al., “Scaling instruction-finetuned language models,” arXiv preprint arXiv:2210.11416, 2022.
  53. L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.” JMLR, vol. 9, no. 11, 2008.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Zengqun Zhao (4 papers)
  2. Yu Cao (129 papers)
  3. Shaogang Gong (94 papers)
  4. Ioannis Patras (73 papers)
Citations (3)