Self-supervised Gait-based Emotion Representation Learning from Selective Strongly Augmented Skeleton Sequences (2405.04900v1)
Abstract: Emotion recognition is an important part of affective computing. Extracting emotional cues from human gaits yields benefits such as natural interaction, a nonintrusive nature, and remote detection. Recently, the introduction of self-supervised learning techniques offers a practical solution to the issues arising from the scarcity of labeled data in the field of gait-based emotion recognition. However, due to the limited diversity of gaits and the incompleteness of feature representations for skeletons, the existing contrastive learning methods are usually inefficient for the acquisition of gait emotions. In this paper, we propose a contrastive learning framework utilizing selective strong augmentation (SSA) for self-supervised gait-based emotion representation, which aims to derive effective representations from limited labeled gait data. First, we propose an SSA method for the gait emotion recognition task, which includes upper body jitter and random spatiotemporal mask. The goal of SSA is to generate more diverse and targeted positive samples and prompt the model to learn more distinctive and robust feature representations. Then, we design a complementary feature fusion network (CFFN) that facilitates the integration of cross-domain information to acquire topological structural and global adaptive features. Finally, we implement the distributional divergence minimization loss to supervise the representation learning of the generally and strongly augmented queries. Our approach is validated on the Emotion-Gait (E-Gait) and Emilya datasets and outperforms the state-of-the-art methods under different evaluation protocols.
- R. Cowie, E. Douglas-Cowie, N. Tsapatsoulis, G. Votsis, S. Kollias, W. Fellenz, and J. G. Taylor, “Emotion recognition in human-computer interaction,” IEEE Signal Processing Magazine, vol. 18, no. 1, pp. 32–80, 2001.
- T. Zhang, M. Liu, T. Yuan, and N. Al-Nabhan, “Emotion-aware and intelligent internet of medical things toward emotion recognition during covid-19 pandemic,” IEEE Internet of Things Journal, vol. 8, no. 21, pp. 16 002–16 013, 2020.
- K. Zhang, Y. Li, J. Wang, E. Cambria, and X. Li, “Real-time video emotion recognition based on reinforcement learning and domain knowledge,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 3, pp. 1034–1047, 2021.
- V. Narayanan, B. M. Manoghar, V. S. Dorbala, D. Manocha, and A. Bera, “Proxemo: Gait-based emotion learning and multi-view proxemic fusion for socially-aware robot navigation,” in International Conference on Intelligent Robots and Systems (IROS). IEEE, 2020, pp. 8200–8207.
- S. V. Ioannou, A. T. Raouzaiou, V. A. Tzouvaras, T. P. Mailis, K. C. Karpouzis, and S. D. Kollias, “Emotion recognition through facial expression analysis based on a neurofuzzy network,” Neural Networks, vol. 18, no. 4, pp. 423–435, 2005.
- S. Xie, H. Hu, and Y. Chen, “Facial expression recognition with two-branch disentangled generative adversarial network,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 31, no. 6, pp. 2359–2371, 2020.
- S. Umer, R. K. Rout, C. Pero, and M. Nappi, “Facial expression recognition with trade-offs between data augmentation and deep learning features,” Journal of Ambient Intelligence and Humanized Computing, vol. 13, pp. 721–735, 2022.
- M. El Ayadi, M. S. Kamel, and F. Karray, “Survey on speech emotion recognition: Features, classification schemes, and databases,” Pattern Recognition, vol. 44, no. 3, pp. 572–587, 2011.
- Q. Chen and G. Huang, “A novel dual attention-based blstm with hybrid features in speech emotion recognition,” Engineering Applications of Artificial Intelligence, vol. 102, p. 104277, 2021.
- N. Alswaidan and M. E. B. Menai, “A survey of state-of-the-art approaches for emotion recognition in text,” Knowledge and Information Systems, vol. 62, pp. 2937–2987, 2020.
- F. A. Acheampong, H. Nunoo-Mensah, and W. Chen, “Transformer models for text-based emotion detection: a review of bert-based approaches,” Artificial Intelligence Review, vol. 54, pp. 5789–5829, 2021.
- R. Jenke, A. Peer, and M. Buss, “Feature extraction and selection for emotion recognition from eeg,” IEEE Transactions on Affective Computing, vol. 5, no. 3, pp. 327–339, 2014.
- S. Katsigiannis and N. Ramzan, “Dreamer: A database for emotion recognition through eeg and ecg signals from wireless low-cost off-the-shelf devices,” IEEE Journal of Biomedical and Health Informatics, vol. 22, no. 1, pp. 98–107, 2017.
- P. Sarkar and A. Etemad, “Self-supervised ecg representation learning for emotion recognition,” IEEE Transactions on Affective Computing, vol. 13, no. 3, pp. 1541–1554, 2020.
- H. Aviezer, Y. Trope, and A. Todorov, “Body cues, not facial expressions, discriminate between intense positive and negative emotions,” Science, vol. 338, no. 6111, pp. 1225–1229, 2012.
- R. E. Nisbett and T. D. Wilson, “Telling more than we can know: Verbal reports on mental processes.” Psychological Review, vol. 84, no. 3, pp. 231–259, 1977.
- T. Hassan, D. Seuß, J. Wollenberg, K. Weitz, M. Kunz, S. Lautenbacher, J.-U. Garbas, and U. Schmid, “Automatic detection of pain from facial expressions: a survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 6, pp. 1815–1831, 2019.
- J. Martinez, R. Hossain, J. Romero, and J. J. Little, “A simple yet effective baseline for 3d human pose estimation,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017, pp. 2640–2649.
- A. Toshev and C. Szegedy, “Deeppose: Human pose estimation via deep neural networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), 2014, pp. 1653–1660.
- A. Kleinsmith and N. Bianchi-Berthouze, “Affective body expression perception and recognition: A survey,” IEEE Transactions on Affective Computing, vol. 4, no. 1, pp. 15–33, 2012.
- C. L. Roether, L. Omlor, A. Christensen, and M. A. Giese, “Critical features for the perception of emotion from gait,” Journal of Vision, vol. 9, no. 6, p. 15, 2009.
- J. E. Cutting and L. T. Kozlowski, “Recognizing friends by their walk: Gait perception without familiarity cues,” Bulletin of the Psychonomic Society, vol. 9, pp. 353–356, 1977.
- B. Li, C. Zhu, S. Li, and T. Zhu, “Identifying emotions from non-contact gaits information based on microsoft kinects,” IEEE Transactions on Affective Computing, vol. 9, no. 4, pp. 585–591, 2016.
- T. Randhavane, U. Bhattacharya, K. Kapsaskis, K. Gray, A. Bera, and D. Manocha, “Identifying emotions from walking using affective and deep features,” arXiv preprint arXiv:1906.11884, 2019.
- X. Sun, K. Su, and C. Fan, “Vfl—a deep learning-based framework for classifying walking gaits into emotions,” Neurocomputing, vol. 473, pp. 1–13, 2022.
- S. Yan, Y. Xiong, and D. Lin, “Spatial temporal graph convolutional networks for skeleton-based action recognition,” in Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), vol. 32, no. 1, 2018.
- U. Bhattacharya, T. Mittal, R. Chandra, T. Randhavane, A. Bera, and D. Manocha, “Step: Spatial temporal graph convolutional networks for emotion perception from gaits,” in Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), vol. 34, no. 02, 2020, pp. 1342–1350.
- H. Lu, S. Xu, S. Zhao, X. Hu, R. Ma, and B. Hu, “Epic: Emotion perception by spatio-temporal interaction context of gait,” IEEE Journal of Biomedical and Health Informatics, 2023.
- L. Li, M. Wang, B. Ni, H. Wang, J. Yang, and W. Zhang, “3d human action representation learning via cross-view consistency pursuit,” in Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), 2021, pp. 4741–4750.
- T. Guo, H. Liu, Z. Chen, M. Liu, T. Wang, and R. Ding, “Contrastive learning from extremely augmented skeleton sequences for self-supervised action recognition,” in Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), vol. 36, no. 1, 2022, pp. 762–770.
- J. Zhang, L. Lin, and J. Liu, “Hierarchical consistent contrastive learning for skeleton-based action recognition with growing augmentations,” in Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), vol. 37, no. 3, 2023, pp. 3427–3435.
- H. Lu, X. Hu, and B. Hu, “See your emotion from gait using unlabeled skeleton data,” in Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), vol. 37, no. 2, 2023, pp. 1826–1834.
- X. Shu, B. Xu, L. Zhang, and J. Tang, “Multi-granularity anchor-contrastive representation learning for semi-supervised skeleton-based action recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 6, pp. 7559–7576, 2022.
- Q. Zeng, C. Liu, M. Liu, and Q. Chen, “Contrastive 3d human skeleton action representation learning via crossmoco with spatiotemporal occlusion mask data augmentation,” IEEE Transactions on Multimedia, vol. 25, pp. 1564–1574, 2023.
- N. Fourati and C. Pelachaud, “Perception of emotions and body movement in the emilya database,” IEEE Transactions on Affective Computing, vol. 9, no. 1, pp. 90–101, 2016.
- T.-N. Nguyen, H.-H. Huynh, and J. Meunier, “Skeleton-based gait index estimation with lstms,” in International Conference on Computer and Information Science (ICIS). IEEE, 2018, pp. 468–473.
- W. Sheng and X. Li, “Multi-task learning for gait-based identity recognition and emotion recognition using attention enhanced temporal graph convolutional network,” Pattern Recognition, vol. 114, p. 107868, 2021.
- Y. Yin, L. Jing, F. Huang, G. Yang, and Z. Wang, “Msa-gcn: Multiscale adaptive graph convolution network for gait emotion recognition,” Pattern Recognition, p. 110117, 2023.
- R. Zhang, P. Isola, and A. A. Efros, “Colorful image colorization,” in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III 14. Springer, 2016, pp. 649–666.
- D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros, “Context encoders: Feature learning by inpainting,” in Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), 2016, pp. 2536–2544.
- K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” in Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), 2020, pp. 9729–9738.
- T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in International Conference on Machine Learning (ICML), 2020, pp. 1597–1607.
- X. Shen, X. Liu, X. Hu, D. Zhang, and S. Song, “Contrastive learning of subject-invariant eeg representations for cross-subject emotion recognition,” IEEE Transactions on Affective Computing, vol. 14, no. 3, pp. 2496–2511, 2022.
- S. Mai, Y. Zeng, S. Zheng, and H. Hu, “Hybrid contrastive learning of tri-modal representation for multimodal sentiment analysis,” IEEE Transactions on Affective Computing, vol. 14, no. 3, pp. 2276–2289, 2022.
- S. Roy and A. Etemad, “Self-supervised contrastive learning of multi-view facial expressions,” in Proceedings of the 2021 International Conference on Multimodal Interaction, 2021, pp. 253–257.
- X. Wang, D. Zhang, H.-Z. Tan, and D.-J. Lee, “A self-fusion network based on contrastive learning for group emotion recognition,” IEEE Transactions on Computational Social Systems, vol. 10, no. 2, pp. 458–469, 2022.
- H. Rao, S. Xu, X. Hu, J. Cheng, and B. Hu, “Augmented skeleton based contrastive action learning with momentum lstm for unsupervised action recognition,” Information Sciences, vol. 569, pp. 90–109, 2021.
- X. Chen and K. He, “Exploring simple siamese representation learning,” in Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), 2021, pp. 15 750–15 758.
- M. Karg, K. Kühnlenz, and M. Buss, “Recognition of affect based on gait patterns,” IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 40, no. 4, pp. 1050–1061, 2010.
- Z. Huang, Z. Zhang, C. Lan, Z.-J. Zha, Y. Lu, and B. Guo, “Adaptive frequency filters as efficient global token mixers,” in Proceedings of the IEEE conference on computer vision and pattern recognition (ICCV), 2023, pp. 6049–6059.
- X. Wang and G.-J. Qi, “Contrastive learning with stronger augmentations,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 5, pp. 5549–5560, 2022.
- Y. Ma, H. M. Paterson, and F. E. Pollick, “A motion capture library for the study of identity, gender, and emotion perception from biological motion,” Behavior Research Methods, vol. 38, no. 1, pp. 134–141, 2006.
- C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu, “Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 7, pp. 1325–1339, 2013.
- S. Narang, A. Best, A. Feng, S.-h. Kang, D. Manocha, and A. Shapiro, “Motion recognition of self and others on realistic 3d avatars,” Computer Animation and Virtual Worlds, vol. 28, no. 3-4, p. e1762, 2017.
- L. Sigal, A. O. Balan, and M. J. Black, “Humaneva: Synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion,” International Journal of Computer Vision, vol. 87, no. 1-2, pp. 4–27, 2010.
- T. Komura, I. Habibie, D. Holden, J. Schwarz, and J. Yearsley, “A recurrent variational autoencoder for human motion synthesis,” in The 28th British Machine Vision Conference(BMVC), 2017.
- D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” Journal of Machine Learning Research, vol. 3, no. Jan, pp. 993–1022, 2003.