Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

EMERSK -- Explainable Multimodal Emotion Recognition with Situational Knowledge (2306.08657v1)

Published 14 Jun 2023 in cs.CV, cs.LG, and cs.MM

Abstract: Automatic emotion recognition has recently gained significant attention due to the growing popularity of deep learning algorithms. One of the primary challenges in emotion recognition is effectively utilizing the various cues (modalities) available in the data. Another challenge is providing a proper explanation of the outcome of the learning.To address these challenges, we present Explainable Multimodal Emotion Recognition with Situational Knowledge (EMERSK), a generalized and modular system for human emotion recognition and explanation using visual information. Our system can handle multiple modalities, including facial expressions, posture, and gait, in a flexible and modular manner. The network consists of different modules that can be added or removed depending on the available data. We utilize a two-stream network architecture with convolutional neural networks (CNNs) and encoder-decoder style attention mechanisms to extract deep features from face images. Similarly, CNNs and recurrent neural networks (RNNs) with Long Short-term Memory (LSTM) are employed to extract features from posture and gait data. We also incorporate deep features from the background as contextual information for the learning process. The deep features from each module are fused using an early fusion network. Furthermore, we leverage situational knowledge derived from the location type and adjective-noun pair (ANP) extracted from the scene, as well as the spatio-temporal average distribution of emotions, to generate explanations. Ablation studies demonstrate that each sub-network can independently perform emotion recognition, and combining them in a multimodal approach significantly improves overall recognition performance. Extensive experiments conducted on various benchmark datasets, including GroupWalk, validate the superior performance of our approach compared to other state-of-the-art methods.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. “Fer-2013 learn facial expressions from an image,” https://www.kaggle.com/msambare/fer2013.
  2. A. Mollahosseini, D. Chan, and M. H. Mahoor, “Going deeper in facial expression recognition using deep neural networks,” in 2016 IEEE Winter conference on applications of computer vision (WACV).   IEEE, 2016, pp. 1–10.
  3. T. Randhavane, U. Bhattacharya, P. Kabra, K. Kapsaskis, K. Gray, D. Manocha, and A. Bera, “Learning gait emotions using affective and deep features,” in Proceedings of the 15th ACM SIGGRAPH Conference on Motion, Interaction and Games, 2022, pp. 1–10.
  4. S. K. D’mello and A. Graesser, “Multimodal semi-automated affect detection from conversational cues, gross body language, and facial features,” User Modeling and User-Adapted Interaction, vol. 20, no. 2, pp. 147–187, 2010.
  5. T. Mittal, P. Guhan, U. Bhattacharya, R. Chandra, A. Bera, and D. Manocha, “Emoticon: Context-aware multimodal emotion recognition using frege’s principle,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 14 234–14 243.
  6. M. A. Conway and D. A. Bekerian, “Situational knowledge and emotions,” Cognition and emotion, vol. 1, no. 2, pp. 145–191, 1987.
  7. S. Knobloch-Westerwick, J. Abdallah, and A. C. Billings, “The football boost? testing three models on impacts on sports spectators’ self-esteem,” Communication & Sport, vol. 8, no. 2, pp. 236–261, 2020.
  8. R. Kosti, J. M. Alvarez, A. Recasens, and A. Lapedriza, “Context based emotion recognition using emotic dataset,” IEEE transactions on pattern analysis and machine intelligence, vol. 42, no. 11, pp. 2755–2766, 2019.
  9. J. Lee, S. Kim, S. Kim, J. Park, and K. Sohn, “Context-aware emotion recognition networks,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 10 143–10 152.
  10. A. Adadi and M. Berrada, “Peeking inside the black-box: a survey on explainable artificial intelligence (xai),” IEEE access, vol. 6, pp. 52 138–52 160, 2018.
  11. K. Simonyan, A. Vedaldi, and A. Zisserman, “Deep inside convolutional networks: Visualising image classification models and saliency maps,” arXiv preprint arXiv:1312.6034, 2013.
  12. R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-cam: Visual explanations from deep networks via gradient-based localization,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 618–626.
  13. S. M. Lundberg and S.-I. Lee, “A unified approach to interpreting model predictions,” Advances in neural information processing systems, vol. 30, 2017.
  14. M. T. Ribeiro, S. Singh, and C. Guestrin, “” why should i trust you?” explaining the predictions of any classifier,” in Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, 2016, pp. 1135–1144.
  15. M. Deramgozin, S. Jovanovic, H. Rabah, and N. Ramzan, “A hybrid explainable ai framework applied to global and local facial expression recognition,” in 2021 IEEE International Conference on Imaging Systems and Techniques (IST).   IEEE, 2021, pp. 1–5.
  16. A. A. Kandeel, H. M. Abbas, and H. S. Hassanein, “Explainable model selection of a convolutional neural network for driver’s facial emotion identification,” in Pattern Recognition. ICPR International Workshops and Challenges: Virtual Event, January 10–15, 2021, Proceedings, Part VI.   Springer, 2021, pp. 699–713.
  17. Y. Gan, J. Chen, and L. Xu, “Facial expression recognition boosted by soft label with a diverse ensemble,” Pattern Recognition Letters, vol. 125, pp. 105–112, 2019.
  18. P. Dhankhar, “Resnet-50 and vgg-16 for recognizing facial emotions,” International Journal of Innovations in Engineering and Technology (IJIET), vol. 13, no. 4, pp. 126–130, 2019.
  19. U. Bhattacharya, T. Mittal, R. Chandra, T. Randhavane, A. Bera, and D. Manocha, “Step: Spatial temporal graph convolutional networks for emotion perception from gaits,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 02, 2020, pp. 1342–1350.
  20. A. Renda, M. Barsacchi, A. Bechini, and F. Marcelloni, “Comparing ensemble strategies for deep learning: An application to facial expression recognition,” Expert Systems with Applications, vol. 136, pp. 1–11, 2019.
  21. H. Liu, H. Cai, Q. Lin, X. Li, and H. Xiao, “Adaptive multilayer perceptual attention network for facial expression recognition,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 9, pp. 6253–6266, 2022.
  22. J. L. Joseph and S. P. Mathew, “Facial expression recognition for the blind using deep learning,” in 2021 IEEE 4th International Conference on Computing, Power and Communication Technologies (GUCON), 2021, pp. 1–5.
  23. A. P. Fard and M. H. Mahoor, “Ad-corre: Adaptive correlation-based loss for facial expression recognition in the wild,” IEEE Access, vol. 10, pp. 26 756–26 768, 2022.
  24. P. Tahghighi, A. Koochari, and M. Jalali, “Deformable convolutional lstm for human body emotion recognition,” in Pattern Recognition. ICPR International Workshops and Challenges: Virtual Event, January 10–15, 2021, Proceedings, Part III.   Springer, 2021, pp. 741–747.
  25. W. Li, X. Dong, and Y. Wang, “Human emotion recognition with relational region-level analysis,” IEEE Transactions on Affective Computing, 2021.
  26. P. D. Marrero Fernandez, F. A. Guerrero Pena, T. Ren, and A. Cunha, “Feratt: Facial expression recognition with attention net,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2019, pp. 0–0.
  27. K. Sikka, K. Dykstra, S. Sathyanarayana, G. Littlewort, and M. Bartlett, “Multiple kernel learning for emotion recognition in the wild,” in Proceedings of the 15th ACM on International conference on multimodal interaction, 2013, pp. 517–524.
  28. H. Gunes and M. Piccardi, “Bi-modal emotion recognition from expressive face and body gestures,” Journal of Network and Computer Applications, vol. 30, no. 4, pp. 1334–1345, 2007.
  29. K. R. Scherer and H. Ellgring, “Multimodal expression of emotion: Affect programs or componential appraisal patterns?” Emotion, vol. 7, no. 1, p. 158, 2007.
  30. G. Castellano, M. Mortillaro, A. Camurri, G. Volpe, and K. Scherer, “Automated analysis of body movement in emotionally expressive piano performances,” Music Perception, vol. 26, no. 2, pp. 103–119, 2008.
  31. L. Chen, M. Li, M. Wu, W. Pedrycz, and K. Hirota, “Coupled multimodal emotional feature analysis based on broad-deep fusion networks in human–robot interaction,” IEEE Transactions on Neural Networks and Learning Systems, 2023.
  32. S. Poria, I. Chaturvedi, E. Cambria, and A. Hussain, “Convolutional mkl based multimodal emotion recognition and sentiment analysis,” in 2016 IEEE 16th international conference on data mining (ICDM).   IEEE, 2016, pp. 439–448.
  33. B. Sun, S. Cao, J. He, and L. Yu, “Affect recognition from facial movements and body gestures by hierarchical deep spatio-temporal features and fusion strategy,” Neural Networks, vol. 105, pp. 36–51, 2018.
  34. M. Li, L. Chen, M. Wu, W. Pedrycz, and K. Hirota, “Multimodal information-based broad and deep learning model for emotion understanding,” in 2021 40th Chinese Control Conference (CCC).   IEEE, 2021, pp. 7410–7414.
  35. T. Mittal, A. Bera, and D. Manocha, “Multimodal and context-aware emotion perception model with multiplicative fusion,” IEEE MultiMedia, vol. 28, no. 2, pp. 67–75, 2021.
  36. Y. Bhatia, A. H. Bari, and M. Gavrilova, “A lstm-based approach for gait emotion recognition,” in 2021 IEEE 20th International Conference on Cognitive Informatics & Cognitive Computing (ICCI* CC).   IEEE, 2021, pp. 214–221.
  37. T. Mittal, U. Bhattacharya, R. Chandra, A. Bera, and D. Manocha, “M3er: Multiplicative multimodal emotion recognition using facial, textual, and speech cues,” in Proceedings of the AAAI conference on artificial intelligence, vol. 34, no. 02, 2020, pp. 1359–1367.
  38. K. Wang, X. Zeng, J. Yang, D. Meng, K. Zhang, X. Peng, and Y. Qiao, “Cascade attention networks for group emotion recognition with face, body and image cues,” in Proceedings of the 20th ACM international conference on multimodal interaction, 2018, pp. 640–645.
  39. T. Gedeon, A. Dhall, J. Joshi, J. Hoey, R. Goecke, and S. Ghosh, “From individual to group-level emotion recognition: emoti w 5.0,” 2021.
  40. E. A. Veltmeijer, C. Gerritsen, and K. Hindriks, “Automatic emotion recognition for groups: a review,” IEEE Transactions on Affective Computing, 2021.
  41. B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba, “Places: A 10 million image database for scene recognition,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 6, pp. 1452–1464, 2017.
  42. D. Borth, R. Ji, T. Chen, T. Breuel, and S.-F. Chang, “Large-scale visual sentiment ontology and detectors using adjective noun pairs,” in Proceedings of the 21st ACM international conference on Multimedia, 2013, pp. 223–232.
  43. V. Bazarevsky, I. Grishchenko, K. Raveendran, T. Zhu, F. Zhang, and M. Grundmann, “Blazepose: On-device real-time body pose tracking,” arXiv preprint arXiv:2006.10204, 2020.
  44. O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18.   Springer, 2015, pp. 234–241.
  45. V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep convolutional encoder-decoder architecture for image segmentation,” IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 12, pp. 2481–2495, 2017.
  46. H. Noh, S. Hong, and B. Han, “Learning deconvolution network for semantic segmentation,” 2015.
  47. R. Santhoshkumar and M. Kalaiselvi Geetha, “Vision-based human emotion recognition using hog-klt feature,” in Proceedings of First International Conference on Computing, Communications, and Cyber-Security (IC4S 2019).   Springer, 2020, pp. 261–272.
  48. A. Crenn, R. A. Khan, A. Meyer, and S. Bouakaz, “Body expression recognition from animated 3d skeleton,” in 2016 International Conference on 3D Imaging (IC3D).   IEEE, 2016, pp. 1–7.
  49. D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 4489–4497.
  50. P. Zhang, C. Lan, J. Xing, W. Zeng, J. Xue, and N. Zheng, “View adaptive neural networks for high performance skeleton-based human action recognition,” IEEE transactions on pattern analysis and machine intelligence, vol. 41, no. 8, pp. 1963–1978, 2019.
  51. M. Daoudi, S. Berretti, P. Pala, Y. Delevoye, and A. Del Bimbo, “Emotion recognition by body movement representation on the manifold of symmetric positive definite matrices,” in Image Analysis and Processing-ICIAP 2017: 19th International Conference, Catania, Italy, September 11-15, 2017, Proceedings, Part I 19.   Springer, 2017, pp. 550–560.
  52. B. Li, C. Zhu, S. Li, and T. Zhu, “Identifying emotions from non-contact gaits information based on microsoft kinects,” IEEE Transactions on Affective Computing, vol. 9, no. 4, pp. 585–591, 2018.
  53. T. Bänziger and K. R. Scherer, “Introducing the geneva multimodal emotion portrayal (gemep) corpus,” Blueprint for affective computing: A sourcebook, vol. 2010, pp. 271–94, 2010.
Citations (4)

Summary

We haven't generated a summary for this paper yet.