Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

VLLMs Provide Better Context for Emotion Understanding Through Common Sense Reasoning (2404.07078v1)

Published 10 Apr 2024 in cs.CV and cs.HC

Abstract: Recognising emotions in context involves identifying the apparent emotions of an individual, taking into account contextual cues from the surrounding scene. Previous approaches to this task have involved the design of explicit scene-encoding architectures or the incorporation of external scene-related information, such as captions. However, these methods often utilise limited contextual information or rely on intricate training pipelines. In this work, we leverage the groundbreaking capabilities of Vision-and-Large-LLMs (VLLMs) to enhance in-context emotion classification without introducing complexity to the training process in a two-stage approach. In the first stage, we propose prompting VLLMs to generate descriptions in natural language of the subject's apparent emotion relative to the visual context. In the second stage, the descriptions are used as contextual information and, along with the image input, are used to train a transformer-based architecture that fuses text and visual features before the final classification task. Our experimental results show that the text and image features have complementary information, and our fused architecture significantly outperforms the individual modalities without any complex training methods. We evaluate our approach on three different datasets, namely, EMOTIC, CAER-S, and BoLD, and achieve state-of-the-art or comparable accuracy across all datasets and metrics compared to much more complex approaches. The code will be made publicly available on github: https://github.com/NickyFot/EmoCommonSense.git

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. Affective computing in education: A systematic review and future research. Computers & Education 142 (2019) 103649
  2. Automatic emotion recognition in clinical scenario: A systematic review of methods. IEEE Transactions on Affective Computing 14(2) (2023) 1675–1695
  3. Context is Everything (in Emotion Research). Social and Personality Psychology Compass 12(6) (2018) e12393
  4. Context in Emotion Perception. Current Directions in Psychological Science 20(5) (October 2011) 286–290 Publisher: SAGE Publications Inc.
  5. Emotic: Emotions in context dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. (2017) 61–69
  6. EmotiCon: Context-Aware Multimodal Emotion Recognition Using Frege’s Principle. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. (2020) 14234–14243
  7. Context-aware emotion recognition networks. In: Proceedings of the IEEE/CVF international conference on computer vision. (2019) 10143–10152
  8. Learning emotion representations from verbal and nonverbal communication. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. (2023) 18993–19004
  9. Context De-Confounded Emotion Recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. (2023) 19005–19015
  10. Facial action coding system: Investigator’s guide. Consulting Psychologists Press (1978)
  11. Prompting visual-language models for dynamic facial expression recognition. In: BMVC. (2023)
  12. EmoCLIP: A Vision-Language Method for Zero-Shot Video Facial Expression Recognition (2023) arXiv:2310.16640 [cs].
  13. Cliper: A unified vision-language framework for in-the-wild facial expression recognition. arXiv preprint arXiv:2303.00193 (2023)
  14. Learning Emotion Representations from Verbal and Nonverbal Communication. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, IEEE (June 2023) 18993–19004
  15. Visual instruction tuning. In: NeurIPS. (2023)
  16. Improved baselines with visual instruction tuning (2023)
  17. ARBEE: Towards Automated Recognition of Bodily Expression of Emotion In the Wild. International Journal of Computer Vision 128 (January 2020)
  18. Step: Spatial temporal graph convolutional networks for emotion perception from gaits. Proceedings of the AAAI Conference on Artificial Intelligence 34 (2020)
  19. Disentangled representation learning for multimodal emotion recognition. In: Proceedings of the 30th ACM International Conference on Multimedia. MM ’22, New York, NY, USA
  20. Learning modality-specific and -agnostic representations for asynchronous multimodal language sequences. In: Proceedings of the 30th ACM International Conference on Multimedia, New York, NY, USA, Association for Computing Machinery (2022)
  21. Context-aware emotion recognition based on visual relationship detection. IEEE Access 9 (2021) 90465–90474
  22. Emotion Understanding in Videos Through Body, Context, and Visual-Semantic Embedding Loss (October 2020)
  23. Context-aware generation-based net for multi-label visual emotion recognition. In: 2020 IEEE International Conference on Multimedia and Expo (ICME). (2020)
  24. Human emotion recognition with relational region-level analysis. IEEE Transactions on Affective Computing (2021)
  25. Emotion recognition for multiple context awareness. In: European Conference on Computer Vision, Springer (2022) 144–162
  26. Context-aware affective graph reasoning for emotion recognition. In: 2019 IEEE International Conference on Multimedia and Expo (ICME). (2019) 151–156
  27. Graph reasoning-based emotion recognition network. IEEE Access
  28. Robust lightweight facial expression recognition network with label distribution training. In: Proceedings of the AAAI conference on artificial intelligence. Volume 35. (2021) 3510–3519
  29. Attention-guided context-aware emotional state recognition. In: 2020 IEEE 7th Uttar Pradesh Section International Conference on Electrical, Electronics and Computer Engineering (UPCON). (2020)
  30. A simple baseline for knowledge-based visual question answering. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore (December 2023) 14871–14877
  31. Video chatcaptioner: Towards enriched spatiotemporal descriptions (2023)
  32. Vipergpt: Visual inference via python execution for reasoning (2023)
  33. Chatvideo: A tracklet-centric multimodal and versatile video understanding system (2023)
  34. An empirical study of gpt-3 for few-shot knowledge-based vqa. In: Proceedings of the AAAI Conference on Artificial Intelligence. (2022)
  35. Prompting large language models with answer heuristics for knowledge-based visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). (June 2023) 14974–14983
  36. A simple llm framework for long-range video question-answering (2024)
  37. KAT: A knowledge augmented transformer for vision-and-language. In Carpuat, M., de Marneffe, M.C., Meza Ruiz, I.V., eds.: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Seattle, United States, Association for Computational Linguistics (July 2022) 956–968
  38. Revive: Regional visual representation matters in knowledge-based visual question answering. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A., eds.: Advances in Neural Information Processing Systems. Volume 35., Curran Associates, Inc. (2022) 10560–10571
  39. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
  40. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models (June 2023)
  41. Attention is all you need. In: Advances in Neural Information Processing Systems. Volume 30. (2017)
  42. Instructblip: Towards general-purpose vision-language models with instruction tuning (2023)
  43. Decoupled weight decay regularization. In: International Conference on Learning Representations. (2018)
  44. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
  45. Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models. arXiv preprint arXiv:2306.09265 (2023)
  46. Leveraging Semantic Scene Characteristics and Multi-Stream Convolutional Architectures in a Contextual Approach for Video-Based Visual Emotion Recognition in the Wild. In: 2021 16th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2021). (December 2021) 01–08
  47. Sequential interactive biased network for context-aware emotion recognition. In: 2021 IEEE International Joint Conference on Biometrics (IJCB), IEEE (2021) 1–6
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Alexandros Xenos (5 papers)
  2. Niki Maria Foteinopoulou (5 papers)
  3. Ioanna Ntinou (5 papers)
  4. Ioannis Patras (73 papers)
  5. Georgios Tzimiropoulos (86 papers)
Citations (8)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets