SpotFormer: Multi-Scale Spatio-Temporal Transformer for Facial Expression Spotting (2407.20799v1)
Abstract: Facial expression spotting, identifying periods where facial expressions occur in a video, is a significant yet challenging task in facial expression analysis. The issues of irrelevant facial movements and the challenge of detecting subtle motions in micro-expressions remain unresolved, hindering accurate expression spotting. In this paper, we propose an efficient framework for facial expression spotting. First, we propose a Sliding Window-based Multi-Resolution Optical flow (SW-MRO) feature, which calculates multi-resolution optical flow of the input image sequence within compact sliding windows. The window length is tailored to perceive complete micro-expressions and distinguish between general macro- and micro-expressions. SW-MRO can effectively reveal subtle motions while avoiding severe head movement problems. Second, we propose SpotFormer, a multi-scale spatio-temporal Transformer that simultaneously encodes spatio-temporal relationships of the SW-MRO features for accurate frame-level probability estimation. In SpotFormer, our proposed Facial Local Graph Pooling (FLGP) and convolutional layers are applied for multi-scale spatio-temporal feature extraction. We show the validity of the architecture of SpotFormer by comparing it with several model variants. Third, we introduce supervised contrastive learning into SpotFormer to enhance the discriminability between different types of expressions. Extensive experiments on SAMM-LV and CAS(ME)2 show that our method outperforms state-of-the-art models, particularly in micro-expression spotting.
- P. Ekman, “Emotions revealed: Recognizing faces and feelings to improve communication and emotional life,” Holt Paperback, vol. 128, no. 8, pp. 140–140, 2003.
- C. A. Corneanu, M. O. Simón, J. F. Cohn, and S. E. Guerrero, “Survey on RGB, 3D, thermal, and multimodal approaches for facial expression recognition: History, trends, and affect-related applications,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 8, pp. 1548–1568, 2016.
- T. Fukuda, J. Taguri, F. Arai, M. Nakashima, D. Tachibana, and Y. Hasegawa, “Facial expression of robot face for human-robot mutual communication,” in Proceedings of the International Conference on Robotics and Automation, vol. 1. IEEE, 2002, pp. 46–51.
- B. A. Kopper and D. L. Epperson, “The experience and expression of anger: Relationships with gender, gender role socialization, depression, and mental health functioning.” Journal of Counseling Psychology, vol. 43, no. 2, p. 158, 1996.
- J. Thies, M. Zollhöfer, M. Stamminger, C. Theobalt, and M. Nießner, “FaceVR: Real-time gaze-aware facial reenactment in virtual reality,” ACM Transactions on Graphics, vol. 37, no. 2, pp. 1–15, 2018.
- W.-J. Yan, Q. Wu, J. Liang, Y.-H. Chen, and X. Fu, “How fast are the leaked facial expressions: The duration of micro-expressions,” Journal of Nonverbal Behavior, vol. 37, pp. 217–230, 2013.
- B. Bhushan, “Study of facial micro-expressions in psychology,” Understanding Facial Expressions in Communication: Cross-cultural and Multidisciplinary Perspectives, pp. 265–286, 2015.
- J. Endres and A. Laidlaw, “Micro-expression recognition training in medical students: a pilot study,” BMC Medical Education, vol. 9, no. 1, pp. 1–6, 2009.
- M. O’sullivan, M. G. Frank, C. M. Hurley, and J. Tiwana, “Police lie detection accuracy: The effect of lie scenario.” Law and Human Behavior, vol. 33, no. 6, p. 530, 2009.
- Y. Gan, S. Liong, D. Zheng, S. Li, and C. Bin, “Optical strain based macro-and micro-expression sequence spotting in long video,” in Proceedings of the International Conference on Automatic Face and Gesture Recognition. IEEE, 2020.
- Y. He, S.-J. Wang, J. Li, and M. H. Yap, “Spotting macro-and micro-expression intervals in long video sequences,” in Proceedings of the International Conference on Automatic Face and Gesture Recognition. IEEE, 2020, pp. 742–748.
- H. Yuhong, “Research on micro-expression spotting method based on optical flow features,” in Proceedings of the International Conference on Multimedia. ACM, 2021, pp. 4803–4807.
- Y. Zhao, X. Tong, Z. Zhu, J. Sheng, L. Dai, L. Xu, X. Xia, Y. Jiang, and J. Li, “Rethinking optical flow methods for micro-expression spotting,” in Proceedings of the International Conference on Multimedia. ACM, 2022, pp. 7175–7179.
- M. Verburg and V. Menkovski, “Micro-expression detection in long videos using optical flow and recurrent neural networks,” in Proceedings of the International Conference on Automatic Face and Gesture Recognition. IEEE, 2019, pp. 1–6.
- S.-J. Wang, Y. He, J. Li, and X. Fu, “MESNet: A convolutional neural network for spotting multi-scale micro-expression intervals in long videos,” IEEE Transactions on Image Processing, vol. 30, pp. 3956–3969, 2021.
- C. H. Yap, M. H. Yap, A. Davison, C. Kendrick, J. Li, S.-J. Wang, and R. Cunningham, “3D-CNN for facial micro-and macro-expression spotting on long video sequences using temporal oriented reference frame,” in Proceedings of the International Conference on Multimedia. ACM, 2022, pp. 7016–7020.
- S. Yin, S. Wu, T. Xu, S. Liu, S. Zhao, and E. Chen, “AU-aware graph convolutional network for macroand micro-expression spotting,” in Proceedings of the International Conference on Multimedia and Expo. IEEE, 2023, pp. 228–233.
- W.-W. Yu, J. Jiang, K.-F. Yang, H.-M. Yan, and Y.-J. Li, “LGSNet: A two-stream network for micro- and macro-expression spotting with background modeling,” IEEE Transactions on Affective Computing, pp. 1–18, 2023.
- X. Guo, X. Zhang, L. Li, and Z. Xia, “Micro-expression spotting with multi-scale local transformer in long videos,” Pattern Recognition Letters, vol. 168, pp. 146–152, 2023.
- Y. Guo, B. Li, X. Ben, Y. Ren, J. Zhang, R. Yan, and Y. Li, “A magnitude and angle combined optical flow feature for microexpression spotting,” IEEE MultiMedia, vol. 28, no. 2, pp. 29–39, 2021.
- W.-W. Yu, J. Jiang, and Y.-J. Li, “LSSNet: A two-stream convolutional neural network for spotting macro-and micro-expression in long videos,” in Proceedings of the International Conference on Multimedia. ACM, 2021, pp. 4745–4749.
- W. Leng, S. Zhao, Y. Zhang, S. Liu, X. Mao, H. Wang, T. Xu, and E. Chen, “Abpn: Apex and boundary perception network for micro-and macro-expression spotting,” in Proceedings of the International Conference on Multimedia. ACM, 2022, pp. 7160–7164.
- J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” in Proceedings of the Conference on Computer Vision and Pattern Recognition. IEEE, 2017, pp. 6299–6308.
- Y. Deng, H. Hayashi, and H. Nagahara, “Multi-scale spatio-temporal graph convolutional network for facial expression spotting,” in Proceedings of the International Conference on Automatic Face and Gesture Recognition. IEEE, 2024.
- A. Moilanen, G. Zhao, and M. Pietikäinen, “Spotting rapid facial movements from videos using appearance-based feature difference analysis,” in Proceedings of the International Conference on Pattern Recognition. IEEE, 2014, pp. 1722–1727.
- A. K. Davison, M. H. Yap, and C. Lansley, “Micro-facial movement detection using individualised baselines and histogram-based descriptors,” in Proceedings of the International Conference on Systems, Man, and Cybernetics. IEEE, 2015, pp. 1864–1869.
- S.-J. Wang, S. Wu, X. Qian, J. Li, and X. Fu, “A main directional maximal difference analysis for spotting facial movements from long-term videos,” Neurocomputing, vol. 230, pp. 382–389, 2017.
- Z. Zhang, T. Chen, H. Meng, G. Liu, and X. Fu, “SMEConvNet: A convolutional neural network for spotting spontaneous facial micro-expression from long videos,” IEEE Access, vol. 6, pp. 71 143–71 151, 2018.
- G.-B. Liong, J. See, and L.-K. Wong, “Shallow optical flow three-stream cnn for macro-and micro-expression spotting from long videos,” in Proceedings of the International Conference on Image Processing. IEEE, 2021, pp. 2643–2647.
- B. Yang, J. Wu, Z. Zhou, M. Komiya, K. Kishimoto, J. Xu, K. Nonaka, T. Horiuchi, S. Komorita, G. Hattori et al., “Facial action unit-based deep learning framework for spotting macro-and micro-expressions in long video sequences,” in Proceedings of the International Conference on Multimedia. ACM, 2021, pp. 4794–4798.
- Y.-J. Liu, J.-K. Zhang, W.-J. Yan, S.-J. Wang, G. Zhao, and X. Fu, “A main directional mean optical flow feature for spontaneous micro-expression recognition,” IEEE Transactions on Affective Computing, vol. 7, no. 4, pp. 299–310, 2015.
- C.-L. Zhang, J. Wu, and Y. Li, “Actionformer: Localizing moments of actions with transformers,” in Proceedings of the European Conference on Computer Vision. Springer, 2022, pp. 492–510.
- L. Yang, H. Peng, D. Zhang, J. Fu, and J. Han, “Revisiting anchor mechanisms for temporal action localization,” IEEE Transactions on Image Processing, vol. 29, pp. 8535–8548, 2020.
- H. Xu, A. Das, and K. Saenko, “R-c3d: Region convolutional 3d network for temporal activity detection,” in Proceedings of the International Conference on Computer Vision (ICCV). IEEE, Oct 2017.
- Y.-W. Chao, S. Vijayanarasimhan, B. Seybold, D. A. Ross, J. Deng, and R. Sukthankar, “Rethinking the faster r-cnn architecture for temporal action localization,” in Proceedings of the Conference on Computer Vision and Pattern Recognition. IEEE, June 2018.
- T. Lin, X. Zhao, H. Su, C. Wang, and M. Yang, “Bsn: Boundary sensitive network for temporal action proposal generation,” in Proceedings of the European Conference on Computer Vision. Springer, September 2018.
- T. Lin, X. Liu, X. Li, E. Ding, and S. Wen, “Bmn: Boundary-matching network for temporal action proposal generation,” in Proceedings of the International Conference on Computer Vision. IEEE/CVF, October 2019.
- P. Zhao, L. Xie, C. Ju, Y. Zhang, Y. Wang, and Q. Tian, “Bottom-up temporal action localization with mutual regularization,” in Proceedings of the European Conference on Computer Vision. Springer, 2020, pp. 539–555.
- Z. Zhu, W. Tang, L. Wang, N. Zheng, and G. Hua, “Enriching local and global contexts for temporal action localization,” in Proceedings of the International Conference on Computer Vision. IEEE/CVF, October 2021, pp. 13 516–13 525.
- S. Nag, X. Zhu, Y.-Z. Song, and T. Xiang, “Post-processing temporal action detection,” in Proceedings of the Conference on Computer Vision and Pattern Recognition. IEEE/CVF, June 2023, pp. 18 837–18 845.
- J. Shao, X. Wang, R. Quan, J. Zheng, J. Yang, and Y. Yang, “Action sensitivity learning for temporal action localization,” in Proceedings of the International Conference on Computer Vision. IEEE/CVF, October 2023, pp. 13 457–13 469.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proceedings of the International Conference on Neural Information Processing Systems. Curran Associates Inc., 2017, p. 6000–6010.
- A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” in Proceedings of the International Conference on Learning Representations, 2021.
- Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the International Conference on Computer Vision. IEEE/CVF, 2021, pp. 10 012–10 022.
- A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lučić, and C. Schmid, “ViVit: A video vision transformer,” in Proceedings of the International Conference on Computer Vision. IEEE/CVF, October 2021, pp. 6836–6846.
- Z. Liu, J. Ning, Y. Cao, Y. Wei, Z. Zhang, S. Lin, and H. Hu, “Video swin transformer,” in Proceedings of the Conference on Computer Vision and Pattern Recognition. IEEE/CVF, June 2022, pp. 3202–3211.
- T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in Proceedings of the International Conference on Machine Learning. PMLR, 2020, pp. 1597–1607.
- P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y. Tian, P. Isola, A. Maschinot, C. Liu, and D. Krishnan, “Supervised contrastive learning,” in Proceedings of the Advances in Neural Information Processing Systems, vol. 33, 2020, pp. 18 661–18 673.
- C. Chen, “PyTorch Face Landmark: A fast and accurate facial landmark detector,” 2021. [Online]. Available: https://github.com/cunjian/pytorch_face_landmark
- G. Farnebäck, “Two-frame motion estimation based on polynomial expansion,” in Proceedings of the Scandinavian Conference on Image Analysis. Springer, 2003, pp. 363–370.
- Y. Zhang, B. Wu, W. Li, L. Duan, and C. Gan, “STST: Spatial-temporal specialized transformer for skeleton-based action recognition,” in Proceedings of the International Conference on Multimedia. ACM, 2021, p. 3229–3237.
- T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” in Proceedings of the Conference on Computer Vision and Pattern Recognition. IEEE/CVF, 2017, pp. 2117–2125.
- H. Fan, B. Xiong, K. Mangalam, Y. Li, Z. Yan, J. Malik, and C. Feichtenhofer, “Multiscale vision transformers,” in Proceedings of the International Conference on Computer Vision. IEEE/CVF, October 2021, pp. 6824–6835.
- T. Xu and W. Takano, “Graph stacked hourglass networks for 3D human pose estimation,” in Proceedings of the Conference on Computer Vision and Pattern Recognition. IEEE/CVF, 2021, pp. 16 105–16 114.
- T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in Proceedings of the International Conference on Computer Vision. IEEE/CVF, 2017, pp. 2980–2988.
- L.-W. Zhang, J. Li, S.-J. Wang, X.-H. Duan, W.-J. Yan, H.-Y. Xie, and S.-C. Huang, “Spatio-temporal fusion for macro-and micro-expression spotting in long video sequences,” in Proceedings of the International Conference on Automatic Face and Gesture Recognition. IEEE, 2020, pp. 734–741.
- J. Wang, S. Xu, and T. Zhang, “A unique M-pattern for micro-expression spotting in long videos,” in Proceedings of the International Conference on Learning Representations, 2024.
- H. Pan, L. Xie, and Z. Wang, “Local bilinear convolutional neural network for spotting macro-and micro-expression intervals in long video sequences,” in Proceedings of the International Conference on Automatic Face and Gesture Recognition. IEEE, 2020, pp. 749–753.
- G. B. Liong, S.-T. Liong, J. See, and C.-S. Chan, “MTSN: A multi-temporal stream network for spotting facial macro-and micro-expression with hard and soft pseudo-labels,” in Proceedings of the Workshop on Facial Micro-Expression: Advanced Techniques for Multi-Modal Facial Expression Analysis. ACM, 2022, pp. 3–10.
- C. H. Yap, C. Kendrick, and M. H. Yap, “Samm long videos: A spontaneous facial micro-and macro-expressions dataset,” in Proceedings of the International Conference on Automatic Face and Gesture Recognition. IEEE, 2020, pp. 771–776.
- F. Qu, S.-J. Wang, W.-J. Yan, H. Li, S. Wu, and X. Fu, “CAS(ME)2: A database for spontaneous macro-expression and micro-expression spotting and recognition,” IEEE Transactions on Affective Computing, vol. 9, no. 4, pp. 424–436, 2017.
- I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in Proceedings of the International Conference on Learning Representations, 2019.
- R. Bro and A. K. Smilde, “Principal component analysis,” Analytical Methods, vol. 6, no. 9, pp. 2812–2831, 2014.
- Yicheng Deng (4 papers)
- Hideaki Hayashi (26 papers)
- Hajime Nagahara (37 papers)