Enhancing Zero-Shot Facial Expression Recognition by LLM Knowledge Transfer (2405.19100v3)
Abstract: Current facial expression recognition (FER) models are often designed in a supervised learning manner and thus are constrained by the lack of large-scale facial expression images with high-quality annotations. Consequently, these models often fail to generalize well, performing poorly on unseen images in inference. Vision-language-based zero-shot models demonstrate a promising potential for addressing such challenges. However, these models lack task-specific knowledge and therefore are not optimized for the nuances of recognizing facial expressions. To bridge this gap, this work proposes a novel method, Exp-CLIP, to enhance zero-shot FER by transferring the task knowledge from LLMs. Specifically, based on the pre-trained vision-language encoders, we incorporate a projection head designed to map the initial joint vision-language space into a space that captures representations of facial actions. To train this projection head for subsequent zero-shot predictions, we propose to align the projected visual representations with task-specific semantic meanings derived from the LLM encoder, and the text instruction-based strategy is employed to customize the LLM knowledge. Given unlabelled facial data and efficient training of the projection head, Exp-CLIP achieves superior zero-shot results to the CLIP models and several other large vision-LLMs (LVLMs) on seven in-the-wild FER datasets. The code and pre-trained models are available at https://github.com/zengqunzhao/Exp-CLIP.
- J. L. Joseph and S. P. Mathew, “Facial expression recognition for the blind using deep learning,” in IEEE International Conference on Computing, Power and Communication Technologies, 2021, pp. 1–5.
- M. Bishay, P. Palasek, S. Priebe, and I. Patras, “Schinet: Automatic estimation of symptoms of schizophrenia from facial behaviour analysis,” IEEE TAFFC, vol. 12, no. 4, pp. 949–961, 2019.
- L. Maat and M. Pantic, “Gaze-x: Adaptive, affective, multimodal interface for single-user office scenarios,” in Artifical Intelligence for Human Computing, 2007, pp. 251–271.
- S. Li and W. Deng, “Deep facial expression recognition: A survey,” IEEE TAFFC, vol. 13, no. 3, pp. 1195–1215, 2020.
- S. Li and W. Deng, “Reliable crowdsourcing and deep locality-preserving learning for unconstrained facial expression recognition,” IEEE TIP, vol. 28, no. 1, pp. 356–370, 2018.
- A. Mollahosseini, B. Hasani, and M. H. Mahoor, “Affectnet: A database for facial expression, valence, and arousal computing in the wild,” IEEE TAFFC, vol. 10, no. 1, pp. 18–31, 2017.
- X. Jiang, Y. Zong, W. Zheng, C. Tang, W. Xia, C. Lu, and J. Liu, “Dfew: A large-scale database for recognizing dynamic facial expressions in the wild,” in ACM MM, 2020, pp. 2881–2889.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016, pp. 770–778.
- A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in ICLR, 2020.
- Y. Zhang, C. Wang, X. Ling, and W. Deng, “Learn from all: Erasing attention consistency for noisy label facial expression recognition,” in ECCV, 2022, pp. 418–434.
- X. Zhang, T. Wang, X. Li, H. Yang, and L. Yin, “Weakly-supervised text-driven contrastive learning for facial behavior understanding,” arXiv preprint arXiv:2304.00058, 2023.
- H. Li, H. Niu, Z. Zhu, and F. Zhao, “Intensity-aware loss for dynamic facial expression recognition in the wild,” in AAAI, 2023, pp. 67–75.
- H. Wang, B. Li, S. Wu, S. Shen, F. Liu, S. Ding, and A. Zhou, “Rethinking the learning paradigm for dynamic facial expression recognition,” in CVPR, 2023, pp. 17 958–17 968.
- T. Chen, T. Pu, H. Wu, Y. Xie, L. Liu, and L. Lin, “Cross-domain facial expression recognition: A unified evaluation benchmark and adversarial graph learning,” IEEE TPAMI, vol. 44, no. 12, pp. 9887–9903, 2021.
- P. Ekman and W. V. Friesen, “Constants across cultures in the face and emotion.” Journal of Personality and Social Psychology, vol. 17, no. 2, p. 124, 1971.
- D. Matsumoto, “More evidence for the universality of a contempt expression,” Motivation and Emotion, vol. 16, pp. 363–368, 1992.
- S. Du, Y. Tao, and A. M. Martinez, “Compound facial expressions of emotion,” PNAS, vol. 111, no. 15, pp. E1454–E1462, 2014.
- A. S. Cowen, D. Keltner, F. Schroff, B. Jou, H. Adam, and G. Prasad, “Sixteen facial expressions occur in similar contexts worldwide,” Nature, vol. 589, no. 7841, pp. 251–257, 2021.
- F. Pourpanah, M. Abdar, Y. Luo, X. Zhou, R. Wang, C. P. Lim, X.-Z. Wang, and Q. J. Wu, “A review of generalized zero-shot learning methods,” IEEE TPAMI, 2022.
- A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in ICML, 2021, pp. 8748–8763.
- Y. Zheng, H. Yang, T. Zhang, J. Bao, D. Chen, Y. Huang, L. Yuan, D. Chen, M. Zeng, and F. Wen, “General facial representation learning in a visual-linguistic manner,” in CVPR, 2022, pp. 18 697–18 709.
- W. Wu, H. Yao, M. Zhang, Y. Song, W. Ouyang, and J. Wang, “Gpt4vis: What can gpt-4 do for zero-shot visual recognition?” arXiv preprint arXiv:2311.15732, 2023.
- J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,” arXiv preprint arXiv:2301.12597, 2023.
- A. Singh, R. Hu, V. Goswami, G. Couairon, W. Galuba, M. Rohrbach, and D. Kiela, “Flava: A foundational language and vision alignment model,” in CVPR, 2022, pp. 15 638–15 650.
- C. Li, C. Wong, S. Zhang, N. Usuyama, H. Liu, J. Yang, T. Naumann, H. Poon, and J. Gao, “Llava-med: Training a large language-and-vision assistant for biomedicine in one day,” NeurIPS, vol. 36, 2024.
- J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds et al., “Flamingo: a visual language model for few-shot learning,” NeurIPS, vol. 35, pp. 23 716–23 736, 2022.
- S. Zhang, Y. Pan, and J. Z. Wang, “Learning emotion representations from verbal and nonverbal communication,” in CVPR, 2023, pp. 18 993–19 004.
- N. M. Foteinopoulou and I. Patras, “Emoclip: A vision-language method for zero-shot video facial expression recognition,” in FG, 2024.
- J. Gu, Z. Han, S. Chen, A. Beirami, B. He, G. Zhang, R. Liao, Y. Qin, V. Tresp, and P. Torr, “A systematic survey of prompt engineering on vision-language foundation models,” arXiv preprint arXiv:2307.12980, 2023.
- V. Rawte, A. Sheth, and A. Das, “A survey of hallucination in large foundation models,” arXiv preprint arXiv:2309.05922, 2023.
- Z. Zhao, Q. Liu, and S. Wang, “Learning deep global multi-scale and local attention features for facial expression recognition in the wild,” IEEE TIP, vol. 30, pp. 6544–6556, 2021.
- C. Wang, S. Wang, and G. Liang, “Identity-and pose-robust facial expression recognition through adversarial feature learning,” in ACM MM, 2019, pp. 238–246.
- K. Wang, X. Peng, J. Yang, S. Lu, and Y. Qiao, “Suppressing uncertainties for large-scale facial expression recognition,” in CVPR, 2020, pp. 6897–6906.
- Z. Zhao, Q. Liu, and F. Zhou, “Robust lightweight facial expression recognition network with label distribution training,” in AAAI, 2021, pp. 3510–3519.
- Z. Wu and J. Cui, “La-net: Landmark-aware learning for reliable facial expression recognition under label noise,” in ICCV, 2023, pp. 20 698–20 707.
- I. Lee, E. Lee, and S. B. Yoo, “Latent-ofer: Detect, mask, and reconstruct with latent vectors for occluded facial expression recognition,” in ICCV, 2023, pp. 1536–1546.
- Y. Zhang, Y. Li, X. Liu, W. Deng et al., “Leave no stone unturned: Mine extra knowledge for imbalanced facial expression recognition,” NeurIPS, vol. 36, 2024.
- Z. Zhao and Q. Liu, “Former-dfer: Dynamic facial expression recognition transformer,” in ACM MM, 2021, pp. 1553–1561.
- H. Li, M. Sui, Z. Zhu et al., “Nr-dfernet: Noise-robust network for dynamic facial expression recognition,” arXiv preprint arXiv:2206.04975, 2022.
- R. Kawamura, H. Hayashi, N. Takemura, and H. Nagahara, “Midas: Mixing ambiguous data with soft labels for dynamic facial expression recognition,” in WACV, 2024, pp. 6552–6562.
- H. Li, H. Niu, Z. Zhu, and F. Zhao, “Cliper: A unified vision-language framework for in-the-wild facial expression recognition,” arXiv preprint arXiv:2303.00193, 2023.
- Z. Zhao and I. Patras, “Prompting visual-language models for dynamic facial expression recognition,” in BMVC, 2023, pp. 1–14.
- H. Xu, G. Ghosh, P.-Y. Huang, D. Okhonko, A. Aghajanyan, F. Metze, L. Zettlemoyer, and C. Feichtenhofer, “Videoclip: Contrastive pre-training for zero-shot video-text understanding,” in EMNLP, 2021, pp. 6787–6800.
- B. Ni, H. Peng, M. Chen, S. Zhang, G. Meng, J. Fu, S. Xiang, and H. Ling, “Expanding language-image pretrained models for general video recognition,” in ECCV, 2022, pp. 1–18.
- J. Oldfield, C. Tzelepis, Y. Panagakis, M. Nicolaou, and I. Patras, “Parts of speech–grounded subspaces in vision-language models,” NeurIPS, vol. 36, pp. 2700–2724, 2023.
- A. Xenos, N. M. Foteinopoulou, I. Ntinou, I. Patras, and G. Tzimiropoulos, “Vllms provide better context for emotion understanding through common sense reasoning,” arXiv preprint arXiv:2404.07078, 2024.
- E. Barsoum, C. Zhang, C. C. Ferrer, and Z. Zhang, “Training deep networks for facial expression recognition with crowd-sourced label distribution,” in ICMI, 2016, pp. 279–283.
- Y. Wang, Y. Sun, Y. Huang, Z. Liu, S. Gao, W. Zhang, W. Ge, and W. Zhang, “Ferv39k: a large-scale multi-scene dataset for facial expression recognition in videos,” in CVPR, 2022, pp. 20 922–20 931.
- Y. Liu, W. Dai, C. Feng, W. Wang, G. Yin, J. Zeng, and S. Shan, “Mafw: A large-scale, multi-modal, compound affective database for dynamic facial expression recognition in the wild,” in ACM MM, 2022, pp. 24–32.
- A. Dhall, R. Goecke, S. Lucey, and T. Gedeon, “Collecting large, richly annotated facial-expression databases from movies,” IEEE Multimedia, vol. 9, no. 3, pp. 34–41, 2012.
- J. Lee, S. Kim, S. Kim, J. Park, and K. Sohn, “Context-aware emotion recognition networks,” in ICCV, 2019, pp. 10 143–10 152.
- H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, Y. Li, X. Wang, M. Dehghani, S. Brahma et al., “Scaling instruction-finetuned language models,” arXiv preprint arXiv:2210.11416, 2022.
- L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.” JMLR, vol. 9, no. 11, 2008.
- Zengqun Zhao (4 papers)
- Yu Cao (129 papers)
- Shaogang Gong (94 papers)
- Ioannis Patras (73 papers)