Contextual Emotion Estimation from Image Captions (2309.13136v1)
Abstract: Emotion estimation in images is a challenging task, typically using computer vision methods to directly estimate people's emotions using face, body pose and contextual cues. In this paper, we explore whether LLMs can support the contextual emotion estimation task, by first captioning images, then using an LLM for inference. First, we must understand: how well do LLMs perceive human emotions? And which parts of the information enable them to determine emotions? One initial challenge is to construct a caption that describes a person within a scene with information relevant for emotion perception. Towards this goal, we propose a set of natural language descriptors for faces, bodies, interactions, and environments. We use them to manually generate captions and emotion annotations for a subset of 331 images from the EMOTIC dataset. These captions offer an interpretable representation for emotion estimation, towards understanding how elements of a scene affect emotion perception in LLMs and beyond. Secondly, we test the capability of a LLM to infer an emotion from the resulting image captions. We find that GPT-3.5, specifically the text-davinci-003 model, provides surprisingly reasonable emotion predictions consistent with human annotations, but accuracy can depend on the emotion concept. Overall, the results suggest promise in the image captioning and LLM approach.
- L. F. Barrett, R. Adolphs, S. Marsella, A. M. Martinez, and S. D. Pollak, “Emotional expressions reconsidered: Challenges to inferring emotion from human facial movements,” Psychological Science in the Public Interest, vol. 20, no. 1, pp. 1–68, 2019.
- M. Pantic and L. J. Rothkrantz, “Expert system for automatic analysis of facial expressions,” Image and Vision Computing, vol. 18, no. 11, pp. 881–905, 2000.
- K. Schindler, L. Van Gool, and B. De Gelder, “Recognizing emotions expressed by body pose: A biologically inspired neural model,” Neural Networks, vol. 21, no. 9, pp. 1238–1246, 2008.
- L. F. Barrett, B. Mesquita, and M. Gendron, “Context in emotion perception,” Current Directions in Psychological Science, vol. 20, no. 5, pp. 286–290, 2011.
- M. Calbi, K. Heimann, D. Barratt, F. Siri, M. A. Umiltà, and V. Gallese, “How context influences our perception of emotional faces: A behavioral study on the kuleshov effect,” Frontiers in Psychology, vol. 8, p. 1684, 2017.
- U. Hess and S. Hareli, “The influence of context on emotion recognition in humans,” in Proceedings of the 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), vol. 03, 2015, pp. 1–6.
- R. Kosti, J. M. Alvarez, A. Recasens, and A. Lapedriza, “Context based emotion recognition using EMOTIC dataset,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 11, pp. 2755–2766, 2019.
- N. Le, K. Nguyen, A. Nguyen, and B. Le, “Global-local attention for emotion recognition,” Neural Computing and Applications, vol. 34, no. 24, pp. 21 625–21 639, 2022.
- T. Mittal, P. Guhan, U. Bhattacharya, R. Chandra, A. Bera, and D. Manocha, “EmotiCon: Context-aware multimodal emotion recognition using frege’s principle,” in Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 14 234–14 243.
- Z. Wang, L. Lao, X. Zhang, Y. Li, T. Zhang, and Z. Cui, “Context-dependent emotion recognition,” Journal of Visual Communication and Image Representation, vol. 89, p. 103679, 2022.
- B. Dudzik, J. Broekens, M. Neerincx, and H. Hung, “Exploring personal memories and video content as context for facial behavior in predictions of video-induced emotions,” in Proceedings of the 2020 International Conference on Multimodal Interaction, 2020, pp. 153–162.
- S. Shin, D. Kim, and C. Wallraven, “Contextual modulation of affect: Comparing humans and deep neural networks,” in Companion Publication of the 2022 International Conference on Multimodal Interaction, 2022, pp. 127–133.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in Neural Information Processing Systems, vol. 30, 2017.
- A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019.
- J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
- T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901, 2020.
- S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh, “VQA: Visual question answering,” in Proceedings of the 2015 IEEE International Conference on Computer Vision, 2015, pp. 2425–2433.
- O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural image caption generator,” in Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3156–3164.
- M. Nadeem, A. Bethke, and S. Reddy, “StereoSet: Measuring stereotypical bias in pretrained language models,” arXiv preprint arXiv:2004.09456, 2020.
- D. Patterson, J. Gonzalez, Q. Le, C. Liang, L.-M. Munguia, D. Rothchild, D. So, M. Texier, and J. Dean, “Carbon emissions and large neural network training,” arXiv preprint arXiv:2104.10350, 2021.
- Vera Yang (1 paper)
- Archita Srivastava (2 papers)
- Yasaman Etesam (4 papers)
- Chuxuan Zhang (4 papers)
- Angelica Lim (21 papers)