Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

GPT-4V with Emotion: A Zero-shot Benchmark for Generalized Emotion Recognition (2312.04293v3)

Published 7 Dec 2023 in cs.CV and cs.MM

Abstract: Recently, GPT-4 with Vision (GPT-4V) has demonstrated remarkable visual capabilities across various tasks, but its performance in emotion recognition has not been fully evaluated. To bridge this gap, we present the quantitative evaluation results of GPT-4V on 21 benchmark datasets covering 6 tasks: visual sentiment analysis, tweet sentiment analysis, micro-expression recognition, facial emotion recognition, dynamic facial emotion recognition, and multimodal emotion recognition. This paper collectively refers to these tasks as ``Generalized Emotion Recognition (GER)''. Through experimental analysis, we observe that GPT-4V exhibits strong visual understanding capabilities in GER tasks. Meanwhile, GPT-4V shows the ability to integrate multimodal clues and exploit temporal information, which is also critical for emotion recognition. However, it's worth noting that GPT-4V is primarily designed for general domains and cannot recognize micro-expressions that require specialized knowledge. To the best of our knowledge, this paper provides the first quantitative assessment of GPT-4V for GER tasks. We have open-sourced the code and encourage subsequent researchers to broaden the evaluation scope by including more tasks and datasets. Our code and evaluation results are available at: https://github.com/zeroQiaoba/gpt4v-emotion.

Evaluation of GPT-4V with Emotion in Generalized Emotion Recognition Tasks

The academic paper "GPT-4V with Emotion: A Zero-shot Benchmark for Generalized Emotion Recognition" provides an in-depth evaluation of GPT-4V's applicability in the area of Generalized Emotion Recognition (GER). This paper is timely due to the increasing attention surrounding multimodal LLMs and their potential for tasks involving emotion recognition, especially considering GPT-4V's enhanced visual abilities.

Evaluation Overview

The paper presents the first quantitative assessment of GPT-4V's capabilities in GER. The evaluation framework is grounded on 21 benchmark datasets encompassing a spectrum of six tasks: visual sentiment analysis, tweet sentiment analysis, micro-expression recognition, facial emotion recognition, dynamic facial emotion recognition, and multimodal emotion recognition. Collectively, these tasks are aggregated under GER, providing a comprehensive overview of GPT-4V's performance across different emotion recognition challenges.

Key Findings

Through a detailed empirical analysis, several insights emerge:

  1. Performance on General Tasks: GPT-4V demonstrates significant proficiency in general-purpose emotion recognition tasks such as visual sentiment analysis and facial emotion recognition. It notably surpasses heuristic baselines, which include random guessing and majority baseline methods.
  2. Limitations in Specialized Domains: On micro-expression recognition tasks, which require specialized knowledge and subtle emotional nuance detection, GPT-4V's performance was sub-par compared to traditional supervised systems. This indicates a limitation of GPT-4V when applied to domains necessitating domain-specific expertise.
  3. Multimodal Integration and Temporal Modeling: The model's ability to synthesize information from multiple modalities and to model temporal dependencies was substantiated in its performance in dynamic facial emotion recognition and multimodal emotion recognition tasks. This capability enhances the potential application of GPT-4V in scenarios where emotions are expressed and perceived through combined cues over time.
  4. Robustness and Stability: The paper notes variations in prediction stability and identifies the effect of different modalities and input formats. Further, the analysis underscores that GPT-4V maintains robustness to changes in color space and prompt template variations, contributing to the model's adaptability in varied experimental settings.

Implications and Future Work

The implications of this work extend into both practical applications and theoretical explorations. Practically, the paper suggests potential applications in social media analysis, education technologies, and customer interaction platforms, where an understanding of emotions plays a crucial role. Theoretically, the paper invites further exploration into improving modality support, notably the integration of audio data, to better encapsulate the multifaceted nature of human emotions.

The limitations observed, including performance stability and security checks related issues, point to avenues for development in model training techniques and architectural adjustments to amplify the efficacy of models like GPT-4V in emotion recognition. Furthermore, the potential for few-shot learning strategies is highlighted to ameliorate the model's comprehension of domain-specific tasks like micro-expressions.

Conclusion

This paper places GPT-4V at the confluence of advanced visual processing capabilities and emotion recognition tasks, encapsulating its potential and current limitations. By establishing a benchmark and offering detailed evaluations, the authors set a foundation for ongoing research aimed at deepening the integration of multimodal systems for enhanced machine emotional intelligence. This exploration signifies a crucial step forward in refining how AI systems perceive and interpret human-like emotions, contributing to the evolution of empathetic and contextually aware computational models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (77)
  1. Emotional conversation generation with heterogeneous graph neural network. Artificial Intelligence, 308:103714, 2022.
  2. Multimodal dialog system: Generating responses via adaptive decoders. In Proceedings of the 27th ACM International Conference on Multimedia, pages 1098–1106, 2019.
  3. The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421, 9:1, 2023.
  4. Mm-vid: Advancing video understanding with gpt-4v (ision). arXiv preprint arXiv:2310.19773, 2023.
  5. Exploring recommendation capabilities of gpt-4v (ision): A preliminary case study. arXiv preprint arXiv:2311.04199, 2023.
  6. Gpt-4v (ision) as a social media analysis engine. arXiv preprint arXiv:2311.07547, 2023.
  7. Gpt4vis: What can gpt-4 do for zero-shot visual recognition? arXiv preprint arXiv:2311.15732, 2023.
  8. The extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-specified expression. In 2010 ieee computer society conference on computer vision and pattern recognition-workshops, pages 94–101. IEEE, 2010.
  9. Video and image based emotion recognition challenges in the wild: Emotiw 2015. In Proceedings of the 2015 ACM on international conference on multimodal interaction, pages 423–426, 2015.
  10. Training deep networks for facial expression recognition with crowd-sourced label distribution. In Proceedings of the 18th ACM international conference on multimodal interaction, pages 279–283, 2016.
  11. Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2852–2861, 2017.
  12. Affectnet: A database for facial expression, valence, and arousal computing in the wild. IEEE Transactions on Affective Computing, 10(1):18–31, 2017.
  13. Robust image sentiment analysis using progressively trained and domain transferred deep networks. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, pages 381–388, 2015.
  14. Large-scale visual sentiment ontology and detectors using adjective noun pairs. In Proceedings of the 21st ACM international conference on Multimedia, pages 223–232, 2013.
  15. Building a large scale dataset for image emotion recognition: the fine print and the benchmark. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, pages 308–314, 2016.
  16. Casme database: A dataset of spontaneous micro-expressions collected from neutralized faces. In 2013 10th IEEE international conference and workshops on automatic face and gesture recognition (FG), pages 1–7. IEEE, 2013.
  17. Casme ii: An improved spontaneous micro-expression database and the baseline evaluation. PLoS ONE, 9(1):e86041–e86041, 2014.
  18. Samm: A spontaneous micro-facial movement dataset. IEEE transactions on affective computing, 9(1):116–129, 2016.
  19. Dfew: A large-scale database for recognizing dynamic facial expressions in the wild. In Proceedings of the 28th ACM international conference on multimedia, pages 2881–2889, 2020.
  20. Ferv39k: A large-scale multi-scene dataset for facial expression recognition in videos. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20922–20931, 2022.
  21. The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english. PloS one, 13(5):e0196391, 2018.
  22. The enterface’05 audio-visual emotion database. In Proceedings of the 22nd International Conference on Data Engineering Workshops, pages 8–8. IEEE, 2006.
  23. Mer 2023: Multi-label learning, modality robustness, and semi-supervised learning. In Proceedings of the 31st ACM International Conference on Multimedia, pages 9610–9614, 2023.
  24. Ch-sims: A chinese multimodal sentiment analysis dataset with fine-grained annotation of modality. In Proceedings of the 58th annual meeting of the association for computational linguistics, pages 3718–3727, 2020.
  25. Tensor fusion network for multimodal sentiment analysis. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 1103–1114, 2017.
  26. Disentangling identity and pose for facial expression recognition. IEEE Transactions on Affective Computing, 13(4):1868–1878, 2022.
  27. Challenges in representation learning: A report on three machine learning contests. In Proceedings of the 20th International Conference on Neural Information Processing, pages 117–124, 2013.
  28. Paul Ekman. Lie catching and microexpressions. The philosophy of deception, 1(2):5, 2009.
  29. Joint local and global information learning with single apex frame detection for micro-expression recognition. IEEE Transactions on Image Processing, 30:249–263, 2020.
  30. Dpcnet: Dual path multi-excitation collaborative network for facial expression representation learning in videos. In Proceedings of the 30th ACM International Conference on Multimedia, pages 101–110, 2022.
  31. Cross-vae: Towards disentangling expression from identity for human faces. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4087–4091. IEEE, 2020.
  32. Identity-aware convolutional neural network for facial expression recognition. In 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017), pages 558–565. IEEE, 2017.
  33. Facial expression recognition with two-branch disentangled generative adversarial network. IEEE Transactions on Circuits and Systems for Video Technology, 31(6):2359–2371, 2020.
  34. Identity-adaptive facial expression recognition through expression regeneration using conditional generative adversarial networks. In 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), pages 294–301. IEEE, 2018.
  35. Island loss for learning discriminative features in facial expression recognition. In 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), pages 302–309. IEEE, 2018.
  36. Disentangled feature based adversarial learning for facial expression recognition. In 2019 IEEE International Conference on Image Processing (ICIP), pages 31–35. IEEE, 2019.
  37. Region attention networks for pose and occlusion robust facial expression recognition. IEEE Transactions on Image Processing, 29:4057–4069, 2020.
  38. Facial expression recognition with inconsistently annotated datasets. In Proceedings of the European conference on computer vision (ECCV), pages 222–237, 2018.
  39. Hard negative generation for identity-disentangled facial expression recognition. Pattern Recognition, 88:1–12, 2019.
  40. Facenet2expnet: Regularizing a deep face recognition net for expression recognition. In 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017), pages 118–126. IEEE, 2017.
  41. Suppressing uncertainties for large-scale facial expression recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6897–6906, 2020.
  42. Facial expression recognition by using a disentangled identity-invariant expression representation. In 2020 25th International Conference on Pattern Recognition (ICPR), pages 9460–9467. IEEE, 2021.
  43. Covariance pooling for facial expression recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 367–374, 2018.
  44. Identity-free facial expression recognition using conditional generative adversarial network. In 2021 IEEE International Conference on Image Processing (ICIP), pages 1344–1348. IEEE, 2021.
  45. Christina Huang. Combining convolutional neural networks for emotion recognition. In 2017 IEEE MIT undergraduate research technology conference (URTC), pages 1–4. IEEE, 2017.
  46. Identity-and pose-robust facial expression recognition through adversarial feature learning. In Proceedings of the 27th ACM international conference on multimedia, pages 238–246, 2019.
  47. Occlusion aware facial expression recognition using cnn with attention mechanism. IEEE Transactions on Image Processing, 28(5):2439–2450, 2018.
  48. Exploring principles-of-art features for image emotion recognition. In Proceedings of the 22nd ACM international conference on Multimedia, pages 47–56, 2014.
  49. Deepsentibank: Visual sentiment concept classification with deep convolutional neural networks. arXiv preprint arXiv:1410.8586, 2014.
  50. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  51. Visual sentiment prediction based on automatic discovery of affective regions. IEEE Transactions on Multimedia, 20(9):2513–2525, 2018.
  52. Lbp with six intersection points: Reducing redundant information in lbp-top for micro-expression recognition. In Computer Vision–ACCV 2014: 12th Asian Conference on Computer Vision, Singapore, Singapore, November 1-5, 2014, Revised Selected Papers, Part I 12, pages 525–537. Springer, 2015.
  53. Spatiotemporal recurrent convolutional networks for recognizing spontaneous micro-expressions. IEEE Transactions on Multimedia, 22(3):626–640, 2019.
  54. Can micro-expression be recognized based on single apex frame? In 2018 25th IEEE International Conference on Image Processing (ICIP), pages 3094–3098. IEEE, 2018.
  55. Recognizing spontaneous micro-expression using a three-stream convolutional neural network. IEEE Access, 7:184537–184551, 2019.
  56. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 4489–4497, 2015.
  57. Former-dfer: Dynamic facial expression recognition transformer. In Proceedings of the 29th ACM International Conference on Multimedia, pages 1553–1561, 2021.
  58. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 6450–6459, 2018.
  59. Spatio-temporal transformer for dynamic facial expression recognition in the wild. arXiv preprint arXiv:2205.04749, 2022.
  60. Nr-dfernet: Noise-robust network for dynamic facial expression recognition. arXiv preprint arXiv:2206.04975, 2022.
  61. Intensity-aware loss for dynamic facial expression recognition in the wild. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 67–75, 2023.
  62. Rethinking the learning paradigm for dynamic facial expression recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17958–17968, 2023.
  63. Mae-dfer: Efficient masked autoencoder for self-supervised dynamic facial expression recognition. In Proceedings of the 31st ACM International Conference on Multimedia, pages 6110–6121, 2023.
  64. Multimodal and temporal perception of audio-visual cues for emotion recognition. In 2019 8th International Conference on Affective Computing and Intelligent Interaction (ACII), pages 552–558. IEEE, 2019.
  65. Facial expression recognition using 3d convolutional neural network. International journal of advanced computer science and applications, 5(12), 2014.
  66. A deep spatial and temporal aggregation framework for video-based facial expression recognition. IEEE Access, 7:48807–48815, 2019.
  67. Msaf: Multimodal split attention fusion. arXiv preprint arXiv:2012.07175, 2020.
  68. Enhanced convolutional lstm with spatial and temporal skip connections and temporal gates for facial expression recognition from video. Neural Computing and Applications, 33:7381–7392, 2021.
  69. Frame attention networks for facial expression recognition in videos. In 2019 IEEE international conference on image processing (ICIP), pages 3866–3870. IEEE, 2019.
  70. A cross-modal fusion network based on self-attention and residual structure for multimodal emotion recognition. arXiv preprint arXiv:2111.02172, 2021.
  71. Spatial-temporal graphs plus transformers for geometry-guided facial expression recognition. IEEE Transactions on Affective Computing, 2022.
  72. Learning factorized multimodal representations. In Proceedings of the 7th International Conference on Learning Representations, pages 1–20, 2019.
  73. Misa: Modality-invariant and-specific representations for multimodal sentiment analysis. In Proceedings of the 28th ACM International Conference on Multimedia, pages 1122–1131, 2020.
  74. Memory fusion network for multi-view sequential learning. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 5634–5641, 2018.
  75. Improving multimodal fusion with hierarchical mutual information maximization for multimodal sentiment analysis. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 9180–9192, 2021.
  76. Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the 57th Conference of the Association for Computational Linguistics, pages 6558–6569, 2019.
  77. Smin: Semi-supervised multi-modal interaction network for conversational emotion recognition. IEEE Transactions on Affective Computing, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Zheng Lian (51 papers)
  2. Licai Sun (19 papers)
  3. Haiyang Sun (45 papers)
  4. Kang Chen (61 papers)
  5. Zhuofan Wen (7 papers)
  6. Hao Gu (27 papers)
  7. Bin Liu (441 papers)
  8. Jianhua Tao (139 papers)
Citations (15)
Github Logo Streamline Icon: https://streamlinehq.com