Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Enhancing the Prediction of Emotional Experience in Movies using Deep Neural Networks: The Significance of Audio and Language (2306.10397v1)

Published 17 Jun 2023 in cs.CV and cs.AI

Abstract: Our paper focuses on making use of deep neural network models to accurately predict the range of human emotions experienced during watching movies. In this certain setup, there exist three clear-cut input modalities that considerably influence the experienced emotions: visual cues derived from RGB video frames, auditory components encompassing sounds, speech, and music, and linguistic elements encompassing actors' dialogues. Emotions are commonly described using a two-factor model including valence (ranging from happy to sad) and arousal (indicating the intensity of the emotion). In this regard, a Plethora of works have presented a multitude of models aiming to predict valence and arousal from video content. However, non of these models contain all three modalities, with language being consistently eliminated across all of them. In this study, we comprehensively combine all modalities and conduct an analysis to ascertain the importance of each in predicting valence and arousal. Making use of pre-trained neural networks, we represent each input modality in our study. In order to process visual input, we employ pre-trained convolutional neural networks to recognize scenes[1], objects[2], and actions[3,4]. For audio processing, we utilize a specialized neural network designed for handling sound-related tasks, namely SoundNet[5]. Finally, Bidirectional Encoder Representations from Transformers (BERT) models are used to extract linguistic features[6] in our analysis. We report results on the COGNIMUSE dataset[7], where our proposed model outperforms the current state-of-the-art approaches. Surprisingly, our findings reveal that language significantly influences the experienced arousal, while sound emerges as the primary determinant for predicting valence. In contrast, the visual modality exhibits the least impact among all modalities in predicting emotions.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (26)
  1. “Places: An image database for deep scene understanding,” arXiv preprint arXiv:1610.02055, 2016.
  2. “Inception-v4, inception-resnet and the impact of residual connections on learning,” in Thirty-First AAAI Conference on Artificial Intelligence, 2017.
  3. “Quo vadis, action recognition? a new model and the kinetics dataset,” in proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6299–6308.
  4. “Plug-and-play cnn for crowd motion analysis: An application in abnormal event detection,” in 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), 2018, pp. 1689–1698.
  5. “Soundnet: Learning sound representations from unlabeled video,” in Advances in neural information processing systems, 2016, pp. 892–900.
  6. “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
  7. Athanasia . el al Zlatintsi, “Cognimuse: a multimodal video database annotated with saliency, events, semantics and emotion with application to summarization,” EURASIP Journal on Image and Video Processing, 2017.
  8. “Frontal brain electrical activity (eeg) distinguishes valence and intensity of musical emotions,” Cognition & Emotion, vol. 15, no. 4, pp. 487–500, 2001.
  9. “Music and emotion: electrophysiological correlates of the processing of pleasant and unpleasant music,” Psychophysiology, vol. 44, no. 2, pp. 293–304, 2007.
  10. “Emotion-based crowd representation for abnormality detection,” 07 2016.
  11. Stefan Koelsch, “Towards a neural basis of music-evoked emotions,” Trends in cognitive sciences, vol. 14, no. 3, pp. 131–137, 2010.
  12. “Emotions evoked by the sound of music: characterization, classification, and measurement.,” Emotion, vol. 8, no. 4, pp. 494, 2008.
  13. “Speech emotion recognition using deep neural network considering verbal and nonverbal speech sounds,” in ICASSP 2019). IEEE, 2019, pp. 5866–5870.
  14. “Attention-based convolutional neural network and long short-term memory for short-term detection of mood disorders based on elicited speech responses,” 2019.
  15. “Learning supervised scoring ensemble for emotion recognition in the wild,” in Proceedings of the 19th ACM international conference on multimodal interaction, 2017.
  16. “Techniques and applications of emotion recognition in speech,” in 2016(MIPRO). IEEE, 2016, pp. 1278–1283.
  17. Lianzhang et al. Zhu, “Emotion recognition from chinese speech for smart affective services using a combination of svm and dbn,” 2017.
  18. Jeffrey F Cohn and Fernando De la Torre, “Automated face analysis for affective,” in The Oxford handbook of affective computing, p. 131. 2014.
  19. “Multimodal human behavior analysis: learning correlation and interaction across modalities,” in ACM, 2012.
  20. Malcolm Slaney, “Semantic-audio retrieval,” in 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing. IEEE, 2002, vol. 4, pp. IV–4108.
  21. “Bottom-up and top-down attention for image captioning and visual question answering,” in CVPR, 2018, pp. 6077–6086.
  22. “Multimodal deep models for predicting affective responses evoked by movies,” arXiv preprint arXiv:1909.06957, 2019.
  23. “A supervised approach to movie emotion tracking,” in 2011 (ICASSP). IEEE, 2011, pp. 2376–2379.
  24. “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
  25. “Distributed representations of words and phrases and their compositionality,” in Advances in neural information processing systems, 2013.
  26. “Smoothing and differentiation of data by simplified least squares procedures.,” Analytical chemistry, vol. 36, no. 8, pp. 1627–1639, 1964.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)