Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 147 tok/s
Gemini 2.5 Pro 52 tok/s Pro
GPT-5 Medium 27 tok/s Pro
GPT-5 High 30 tok/s Pro
GPT-4o 96 tok/s Pro
Kimi K2 188 tok/s Pro
GPT OSS 120B 398 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

Benchmarking Micro-action Recognition: Dataset, Methods, and Applications (2403.05234v2)

Published 8 Mar 2024 in cs.CV

Abstract: Micro-action is an imperceptible non-verbal behaviour characterised by low-intensity movement. It offers insights into the feelings and intentions of individuals and is important for human-oriented applications such as emotion recognition and psychological assessment. However, the identification, differentiation, and understanding of micro-actions pose challenges due to the imperceptible and inaccessible nature of these subtle human behaviors in everyday life. In this study, we innovatively collect a new micro-action dataset designated as Micro-action-52 (MA-52), and propose a benchmark named micro-action network (MANet) for micro-action recognition (MAR) task. Uniquely, MA-52 provides the whole-body perspective including gestures, upper- and lower-limb movements, attempting to reveal comprehensive micro-action cues. In detail, MA-52 contains 52 micro-action categories along with seven body part labels, and encompasses a full array of realistic and natural micro-actions, accounting for 205 participants and 22,422 video instances collated from the psychological interviews. Based on the proposed dataset, we assess MANet and other nine prevalent action recognition methods. MANet incorporates squeeze-and excitation (SE) and temporal shift module (TSM) into the ResNet architecture for modeling the spatiotemporal characteristics of micro-actions. Then a joint-embedding loss is designed for semantic matching between video and action labels; the loss is used to better distinguish between visually similar yet distinct micro-action categories. The extended application in emotion recognition has demonstrated one of the important values of our proposed dataset and method. In the future, further exploration of human behaviour, emotion, and psychological assessment will be conducted in depth. The dataset and source code are released at https://github.com/VUT-HFUT/Micro-Action.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (74)
  1. L. R. Derogatis, R. Lipman, and L. Covi, “Scl 90,” GROUP, vol. 1, p. 4, 2004.
  2. H. Chen, X. Liu, X. Li, H. Shi, and G. Zhao, “Analyze spontaneous gestures for emotional stress state recognition: A micro-gesture dataset and analysis with deep learning,” in Proceedings of IEEE International Conference on Automatic Face and Gesture Recognition, 2019, pp. 1–8.
  3. O. Köpüklü, T. Ledwon, Y. Rong, N. Kose, and G. Rigoll, “Drivermhg: A multi-modal dataset for dynamic recognition of driver micro hand gestures and a real-time recognition framework,” in Proceedings of IEEE International Conference on Automatic Face and Gesture Recognition, 2020, pp. 77–84.
  4. J. Li, K. Fu, S. Zhao, and S. Ge, “Spatiotemporal knowledge distillation for efficient estimation of aerial video saliency,” IEEE Transactions on Image Processing, vol. 29, pp. 1902–1914, 2019.
  5. D. Zhang, C. Li, F. Lin, D. Zeng, and S. Ge, “Detecting deepfake videos with temporal dropout 3dcnn.” in Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, 2021, pp. 1288–1294.
  6. R. Yonetani, K. M. Kitani, and Y. Sato, “Recognizing micro-actions and reactions from paired egocentric videos,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2629–2638.
  7. Y. Mi and S. Wang, “Recognizing micro actions in videos: learning motion details via segment-level temporal pyramid,” in 2019 IEEE International Conference on Multimedia and Expo, 2019, pp. 1036–1041.
  8. S. Yenduri, N. Perveen, V. Chalavadi et al., “Fine-grained action recognition using dynamic kernels,” Pattern Recognition, vol. 122, p. 108282, 2022.
  9. S. Ge, J. Li, Q. Ye, and Z. Luo, “Detecting masked faces in the wild with lle-cnns,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2682–2690.
  10. S. Ge, S. Zhao, C. Li, and J. Li, “Low-resolution face recognition in the wild via selective knowledge distillation,” IEEE Transactions on Image Processing, vol. 28, no. 4, pp. 2051–2062, 2018.
  11. D. Shao, Y. Zhao, B. Dai, and D. Lin, “Finegym: A hierarchical video dataset for fine-grained action understanding,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 2616–2625.
  12. A. Jourabloo, F. De la Torre, J. Saragih, S.-E. Wei, S. Lombardi, T.-L. Wang, D. Belko, A. Trimble, and H. Badino, “Robust egocentric photo-realistic facial expression transfer for virtual reality,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022, pp. 20 323–20 332.
  13. F. Noroozi, C. A. Corneanu, D. Kamińska, T. Sapiński, S. Escalera, and G. Anbarjafari, “Survey on emotional body gesture recognition,” IEEE Transactions on Affective Computing, vol. 12, no. 2, pp. 505–523, 2018.
  14. D. Guo, W. Zhou, H. Li, and M. Wang, “Hierarchical lstm for sign language translation,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32, no. 1, 2018.
  15. H. Aviezer, Y. Trope, and A. Todorov, “Body cues, not facial expressions, discriminate between intense positive and negative emotions,” Science, vol. 338, no. 6111, pp. 1225–1229, 2012.
  16. S. Xu, J. Fang, X. Hu, E. Ngai, W. Wang, Y. Guo, and V. C. Leung, “Emotion recognition from gait analyses: Current research and future directions,” IEEE Transactions on Computational Social Systems, pp. 1–15, 2022.
  17. X. Liu, H. Shi, H. Chen, Z. Yu, X. Li, and G. Zhao, “imigue: An identity-free video dataset for micro-gesture understanding and emotion analysis,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021, pp. 10 631–10 642.
  18. K. Soomro, A. R. Zamir, and M. Shah, “Ucf101: A dataset of 101 human actions classes from videos in the wild,” arXiv preprint arXiv:1212.0402, pp. 1–7, 2012.
  19. W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev et al., “The kinetics human action video dataset,” arXiv preprint arXiv:1705.06950, pp. 1–22, 2017.
  20. A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, “Large-scale video classification with convolutional neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 1725–1732.
  21. H. Luo, G. Lin, Y. Yao, Z. Tang, Q. Wu, and X. Hua, “Dense semantics-assisted networks for video action recognition,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 5, pp. 3073–3084, 2021.
  22. H. Wu, X. Ma, and Y. Li, “Spatiotemporal multimodal learning with 3d cnns for video action recognition,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 3, pp. 1250–1261, 2021.
  23. Y. Liu, L. Wang, Y. Wang, X. Ma, and Y. Qiao, “Fineaction: A fine-grained video dataset for temporal action localization,” IEEE Transactions on Image Processing, vol. 31, pp. 6937–6950, 2022.
  24. X. Gu, X. Xue, and F. Wang, “Fine-grained action recognition on a novel basketball dataset,” in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, 2020, pp. 2563–2567.
  25. D. Guo, S. Wang, Q. Tian, and M. Wang, “Dense temporal convolution network for sign language translation,” in Proceedings of the 28th International Joint Conference on Artificial Intelligence, 2019, pp. 744–750.
  26. H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, “Hmdb: a large video database for human motion recognition,” in Proceedings of the IEEE International Conference on Computer Vision, 2011, pp. 2556–2563.
  27. F. C. Heilbron, V. Escorcia, B. Ghanem, and J. C. Niebles, “Activitynet: A large-scale video benchmark for human activity understanding,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 961–970.
  28. K. Li, D. Guo, and M. Wang, “Proposal-free video grounding with contextual pyramid network,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 3, 2021, pp. 1902–1910.
  29. J. Fu, J. Gao, and C. Xu, “Learning semantic-aware spatial-temporal attention for interpretable action recognition,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 8, pp. 5213–5224, 2021.
  30. K. Li, D. Guo, and M. Wang, “Vigt: proposal-free video grounding with a learnable token in the transformer,” Science China Information Sciences, vol. 66, no. 10, p. 202102, 2023.
  31. Y. Li, Y. Li, and N. Vasconcelos, “Resound: Towards action recognition without representation bias,” in Proceedings of the European Conference on Computer Vision, 2018, pp. 513–528.
  32. H. Chen, H. Shi, X. Liu, X. Li, and G. Zhao, “Smg: A micro-gesture dataset towards spontaneous body gestures for emotional stress state analysis,” International Journal of Computer Vision, vol. 131, no. 6, pp. 1346–1366, 2023.
  33. M. Balazia, P. Müller, Á. L. Tánczos, A. v. Liechtenstein, and F. Bremond, “Bodily behaviors in social interaction: Novel annotations and state-of-the-art evaluation,” in Proceedings of the ACM International Conference on Multimedia, 2022, pp. 70–79.
  34. H. Zhao, A. Torralba, L. Torresani, and Z. Yan, “Hacs: Human action clips and segments dataset for recognition and temporal localization,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 8668–8678.
  35. M. Rohrbach, S. Amin, M. Andriluka, and B. Schiele, “A database for fine grained activity detection of cooking activities,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2012, pp. 1194–1201.
  36. D. Damen, H. Doughty, G. M. Farinella, S. Fidler, A. Furnari, E. Kazakos, D. Moltisanti, J. Munro, T. Perrett, W. Price et al., “Scaling egocentric vision: The epic-kitchens dataset,” in Proceedings of the European Conference on Computer Vision, 2018, pp. 720–736.
  37. L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool, “Temporal segment networks for action recognition in videos,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 11, pp. 2740–2755, 2018.
  38. J. Lin, C. Gan, and S. Han, “Tsm: Temporal shift module for efficient video understanding,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 7083–7093.
  39. H. Shao, S. Qian, and Y. Liu, “Temporal interlacing network,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2020, pp. 11 966–11 973.
  40. D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 4489–4497.
  41. J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6299–6308.
  42. C. Feichtenhofer, H. Fan, J. Malik, and K. He, “Slowfast networks for video recognition,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 6202–6211.
  43. Z. Liu, J. Ning, Y. Cao, Y. Wei, Z. Zhang, S. Lin, and H. Hu, “Video swin transformer,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022, pp. 3202–3211.
  44. G. Bertasius, H. Wang, and L. Torresani, “Is space-time attention all you need for video understanding?” in Proceedings of International Conference on Machine Learning, 2021, pp. 813–824.
  45. K. Li, Y. Wang, P. Gao, G. Song, Y. Liu, H. Li, and Y. Qiao, “Uniformer: Unified transformer for efficient spatiotemporal representation learning,” arXiv preprint arXiv:2201.04676, pp. 1–19, 2022.
  46. A. Behera, Z. Wharton, Y. Liu, M. Ghahremani, S. Kumar, and N. Bessis, “Regional attention network (ran) for head pose and fine-grained gesture recognition,” IEEE Transactions on Affective Computing, vol. 14, no. 1, pp. 549–562, 2023.
  47. T. Li, L. G. Foo, Q. Ke, H. Rahmani, A. Wang, J. Wang, and J. Liu, “Dynamic spatio-temporal specialization learning for fine-grained action recognition,” in Proceedings of the European Conference on Computer Vision, 2022, pp. 386–403.
  48. B. Xu and X. Shu, “Pyramid self-attention polymerization learning for semi-supervised skeleton-based action recognition,” arXiv preprint arXiv:2302.02327, pp. 1–14, 2023.
  49. Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021, pp. 10 012–10 022.
  50. R. Dai, S. Das, L. Minciullo, L. Garattoni, G. Francesca, and F. Bremond, “Pdan: Pyramid dilated attention network for action detection,” in Proceedings of the IEEE Winter Conference on Applications of Computer Vision, 2021, pp. 2970–2979.
  51. S. Gupta, C. Maple, B. Crispo, K. Raja, A. Yautsiukhin, and F. Martinelli, “A survey of human-computer interaction (hci) & natural habits-based behavioural biometric modalities for user recognition schemes,” Pattern Recognition, vol. 139, p. 109453, 2023.
  52. Y. Chandio, N. Bashir, and F. M. Anwar, “Holoset-a dataset for visual-inertial pose estimation in extended reality: Dataset,” in Proceedings of the ACM Conference on Embedded Networked Sensor Systems, 2022, pp. 1014–1019.
  53. D. Deng, Z. Chen, Y. Zhou, and B. Shi, “Mimamo net: Integrating micro-and macro-motion for video emotion recognition,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 03, 2020, pp. 2621–2628.
  54. Y. Luo, J. Ye, R. B. Adams, J. Li, M. G. Newman, and J. Z. Wang, “Arbee: Towards automated recognition of bodily expression of emotion in the wild,” International Journal of Computer Vision, vol. 128, pp. 1–25, 2020.
  55. L. Zhang, X. Hong, O. Arandjelović, and G. Zhao, “Short and long range relation based spatio-temporal transformer for micro-expression recognition,” IEEE Transactions on Affective Computing, vol. 13, no. 4, pp. 1973–1985, 2022.
  56. H. Ranganathan, S. Chakraborty, and S. Panchanathan, “Multimodal emotion recognition using deep learning architectures,” in Proceedings of the IEEE Winter Conference on Applications of Computer Vision, 2016, pp. 1–9.
  57. M. Gavrilescu, “Recognizing emotions from videos by studying facial expressions, body postures and hand gestures,” in Proceedings of the IEEE Conference on Telecommunications Forum Telfor, 2015, pp. 720–723.
  58. J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7132–7141.
  59. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
  60. J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word representation,” in Proceedings of Conference on Empirical Methods in Natural Language Processing, 2014, pp. 1532–1543.
  61. W. Bao, Q. Yu, and Y. Kong, “Evidential deep learning for open set action recognition,” in Proceedings of the IEEE International Conference on Computer Vision, 2021, pp. 13 349–13 358.
  62. J. Gao, T. He, X. Zhou, and S. Ge, “Skeleton-based action recognition with focusing-diffusion graph convolutional networks,” IEEE Signal Processing Letters, vol. 28, pp. 2058–2062, 2021.
  63. L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.” Journal of Machine Learning Research, vol. 9, no. 11, pp. 2579–2605, 2008.
  64. E. Sanchez, M. K. Tellamekala, M. Valstar, and G. Tzimiropoulos, “Affective processes: stochastic modelling of temporal context for emotion and facial expression recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021, pp. 9074–9084.
  65. J. Lee, S. Kim, S. Kim, J. Park, and K. Sohn, “Context-aware emotion recognition networks,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 10 143–10 152.
  66. X. Li, S. Lai, and X. Qian, “Dbcface: Towards pure convolutional neural network face detection,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 4, pp. 1792–1804, 2021.
  67. D. E. King, “Dlib-ml: A machine learning toolkit,” The Journal of Machine Learning Research, vol. 10, pp. 1755–1758, 2009.
  68. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proceedings of Advances in Neural Information Processing Systems, 2017, pp. 5998–6008.
  69. S. Zhao, Y. Ma, Y. Gu, J. Yang, T. Xing, P. Xu, R. Hu, H. Chai, and K. Keutzer, “An end-to-end visual-audio attention network for emotion recognition in user-generated videos,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2020, pp. 303–311.
  70. D. Meng, X. Peng, K. Wang, and Y. Qiao, “Frame attention networks for facial expression recognition in videos,” in Proceedings of the IEEE International Conference on Image Processing, 2019, pp. 3866–3870.
  71. W. Shen, J. Chen, X. Quan, and Z. Xie, “Dialogxl: All-in-one xlnet for multi-party conversation emotion recognition,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2021, pp. 13 789–13 797.
  72. Y. Wang, W. Song, W. Tao, A. Liotta, D. Yang, X. Li, S. Gao, Y. Sun, W. Ge, W. Zhang, and W. Zhang, “A systematic review on affective computing: emotion models, databases, and recent advances,” Information Fusion, vol. 83-84, pp. 19–52, 2022.
  73. Y. Li, J. Wei, Y. Liu, J. Kauttonen, and G. Zhao, “Deep learning for micro-expression recognition: A survey,” IEEE Transactions on Affective Computing, vol. 13, no. 4, pp. 2028–2046, 2022.
  74. B. Chen, K.-H. Liu, Y. Xu, Q.-Q. Wu, and J.-F. Yao, “Block division convolutional network with implicit deep features augmentation for micro-expression recognition,” IEEE Transactions on Multimedia, vol. 25, pp. 1345–1358, 2023.
Citations (22)

Summary

We haven't generated a summary for this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.