Enhancing Micro Gesture Recognition for Emotion Understanding via Context-aware Visual-Text Contrastive Learning (2405.01885v1)
Abstract: Psychological studies have shown that Micro Gestures (MG) are closely linked to human emotions. MG-based emotion understanding has attracted much attention because it allows for emotion understanding through nonverbal body gestures without relying on identity information (e.g., facial and electrocardiogram data). Therefore, it is essential to recognize MG effectively for advanced emotion understanding. However, existing Micro Gesture Recognition (MGR) methods utilize only a single modality (e.g., RGB or skeleton) while overlooking crucial textual information. In this letter, we propose a simple but effective visual-text contrastive learning solution that utilizes text information for MGR. In addition, instead of using handcrafted prompts for visual-text contrastive learning, we propose a novel module called Adaptive prompting to generate context-aware prompts. The experimental results show that the proposed method achieves state-of-the-art performance on two public datasets. Furthermore, based on an empirical study utilizing the results of MGR for emotion understanding, we demonstrate that using the textual results of MGR significantly improves performance by 6%+ compared to directly using video as input.
- N. Fragopanagos and J. G. Taylor, “Emotion recognition in human–computer interaction,” Neural Networks, vol. 18, no. 4, pp. 389–405, 2005.
- C. Tsiourti, A. Weiss, K. Wac, and M. Vincze, “Multimodal integration of emotional signals from voice, body, and context: Effects of (in) congruence on emotion recognition and attitudes towards robots,” International Journal of Social Robotics, vol. 11, pp. 555–573, 2019.
- M. Chen, X. He, J. Yang, and H. Zhang, “3-d convolutional recurrent neural networks with attention model for speech emotion recognition,” IEEE Signal Processing Letters, vol. 25, no. 10, pp. 1440–1444, 2018.
- A. Mollahosseini, B. Hasani, and M. H. Mahoor, “Affectnet: A database for facial expression, valence, and arousal computing in the wild,” IEEE Transactions on Affective Computing, vol. 10, no. 1, pp. 18–31, 2017.
- M. Sun, J. Li, H. Feng, W. Gou, H. Shen, J. Tang, Y. Yang, and J. Ye, “Multi-modal fusion using spatio-temporal and static features for group emotion recognition,” in International Conference on Multimodal Interaction, 2020, pp. 835–840.
- Y. Yuan, J. Zeng, and S. Shan, “Describe your facial expressions by linking image encoders and large language models,” in British Machine Vision Conference, 2023.
- H. Chen, G. Wang, G. Zhang, P. Zhang, and H. Yang, “Clecg: A novel contrastive learning framework for electrocardiogram arrhythmia classification,” IEEE Signal Processing Letters, vol. 28, pp. 1993–1997, 2021.
- W. Mellouk and W. Handouzi, “Cnn-lstm for automatic emotion recognition using contactless photoplythesmographic signals,” Biomedical Signal Processing and Control, vol. 85, p. 104907, 2023.
- P. Allan, “Body language, how to read others’ thoughts by their gestures,” 1995.
- H. Aviezer, Y. Trope, and A. Todorov, “Body cues, not facial expressions, discriminate between intense positive and negative emotions,” Science, vol. 338, no. 6111, pp. 1225–1229, 2012.
- B. De Gelder, A. W. de Borst, and R. Watson, “The perception of emotion in body expressions,” Wiley Interdisciplinary Reviews: Cognitive Science, vol. 6, no. 2, pp. 149–158, 2015.
- J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” in IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6299–6308.
- J. Lin, C. Gan, and S. Han, “Tsm: Temporal shift module for efficient video understanding,” in IEEE/CVF International Conference on Computer Vision, 2019, pp. 7083–7093.
- Z. Yu, B. Zhou, J. Wan, P. Wang, H. Chen, X. Liu, S. Z. Li, and G. Zhao, “Searching multi-rate and multi-modal temporal enhanced networks for gesture recognition,” IEEE Transactions on Image Processing, vol. 30, pp. 5626–5640, 2021.
- S. Yan, Y. Xiong, and D. Lin, “Spatial temporal graph convolutional networks for skeleton-based action recognition,” in AAAI conference on artificial intelligence, vol. 32, no. 1, 2018.
- Z. Liu, H. Zhang, Z. Chen, Z. Wang, and W. Ouyang, “Disentangling and unifying graph convolutions for skeleton-based action recognition,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 143–152.
- W. Peng, J. Shi, and G. Zhao, “Spatial temporal graph deconvolutional network for skeleton-based human action recognition,” IEEE Signal Processing Letters, vol. 28, pp. 244–248, 2021.
- X. Liu, H. Shi, X. Hong, H. Chen, D. Tao, and G. Zhao, “3d skeletal gesture recognition via hidden states exploration,” IEEE Transactions on Image Processing, vol. 29, pp. 4583–4597, 2020.
- X. Liu and G. Zhao, “3d skeletal gesture recognition via discriminative coding on time-warping invariant riemannian trajectories,” IEEE Transactions on Multimedia, vol. 23, pp. 1841–1854, 2020.
- X. Liu, H. Shi, H. Chen, Z. Yu, X. Li, and G. Zhao, “imigue: An identity-free video dataset for micro-gesture understanding and emotion analysis,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 10 631–10 642.
- R. Gao, X. Liu, J. Yang, and H. Yue, “Cdclr: Clip-driven contrastive learning for skeleton-based action recognition,” in IEEE International Conference on Visual Communications and Image Processing. IEEE, 2022, pp. 1–5.
- A. Shah, H. Chen, H. Shi, and G. Zhao, “Efficient dense-graph convolutional network with inductive prior augmentations for unsupervised micro-gesture recognition,” in International Conference on Pattern Recognition. IEEE, 2022, pp. 2686–2692.
- D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri, “A closer look at spatiotemporal convolutions for action recognition,” in IEEE conference on Computer Vision and Pattern Recognition, 2018, pp. 6450–6459.
- V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter,” arXiv preprint arXiv:1910.01108, 2019.
- G. E. Hinton and S. Roweis, “Stochastic neighbor embedding,” Advances in Neural Information Processing Systems, vol. 15, 2002.
- A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International Conference on Machine Learning. PMLR, 2021, pp. 8748–8763.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in Neural Information Processing Systems, vol. 30, 2017.
- Y. Huang, H. Wen, L. Qing, R. Jin, and L. Xiao, “Emotion recognition based on body and context fusion in the wild,” in IEEE/CVF International Conference on Computer Vision, 2021, pp. 3609–3617.
- T. Keshari and S. Palaniswamy, “Emotion recognition using feature-level fusion of facial expressions and body gestures,” in International Conference on Communication and Electronics Systems. IEEE, 2019, pp. 1184–1189.
- I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017.
- H. Chen, H. Shi, X. Liu, X. Li, and G. Zhao, “Smg: A micro-gesture dataset towards spontaneous body gestures for emotional stress state analysis,” International Journal of Computer Vision, vol. 131, no. 6, pp. 1346–1366, 2023.
- H. Chen, X. Liu, X. Li, H. Shi, and G. Zhao, “Analyze spontaneous gestures for emotional stress state recognition: A micro-gesture dataset and analysis with deep learning,” in IEEE International Conference on Automatic Face & Gesture Recognition. IEEE, 2019, pp. 1–8.
- K. Su, X. Liu, and E. Shlizerman, “Predict & cluster: Unsupervised skeleton based action recognition,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 9631–9640.
- W. Yuan, S. He, and J. Dou, “Mstcn-vae: An unsupervised learning method for micro gesture recognition based on skeleton modality,” in IJCAI-MIGA Workshop & Challenge on Micro-gesture Analysis for Hidden Emotion Understanding. Redaktion Sun SITE, 2023.
- A. Shah, H. Chen, and G. Zhao, “Representation learning for topology-adaptive micro-gesture recognition and analysis,” in IJCAI-MIGA Workshop & Challenge on Micro-gesture Analysis for Hidden Emotion Understanding. Redaktion Sun SITE, 2023.
- K. Cheng, Y. Zhang, X. He, W. Chen, J. Cheng, and H. Lu, “Skeleton-based action recognition with shift graph convolutional network,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 183–192.
- M. Xu, M. Gao, Y.-T. Chen, L. S. Davis, and D. J. Crandall, “Temporal recurrent networks for online action detection,” in IEEE/CVF International Conference on Computer Vision, 2019, pp. 5532–5541.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.