Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SpotFormer: Multi-Scale Spatio-Temporal Transformer for Facial Expression Spotting (2407.20799v1)

Published 30 Jul 2024 in cs.CV

Abstract: Facial expression spotting, identifying periods where facial expressions occur in a video, is a significant yet challenging task in facial expression analysis. The issues of irrelevant facial movements and the challenge of detecting subtle motions in micro-expressions remain unresolved, hindering accurate expression spotting. In this paper, we propose an efficient framework for facial expression spotting. First, we propose a Sliding Window-based Multi-Resolution Optical flow (SW-MRO) feature, which calculates multi-resolution optical flow of the input image sequence within compact sliding windows. The window length is tailored to perceive complete micro-expressions and distinguish between general macro- and micro-expressions. SW-MRO can effectively reveal subtle motions while avoiding severe head movement problems. Second, we propose SpotFormer, a multi-scale spatio-temporal Transformer that simultaneously encodes spatio-temporal relationships of the SW-MRO features for accurate frame-level probability estimation. In SpotFormer, our proposed Facial Local Graph Pooling (FLGP) and convolutional layers are applied for multi-scale spatio-temporal feature extraction. We show the validity of the architecture of SpotFormer by comparing it with several model variants. Third, we introduce supervised contrastive learning into SpotFormer to enhance the discriminability between different types of expressions. Extensive experiments on SAMM-LV and CAS(ME)2 show that our method outperforms state-of-the-art models, particularly in micro-expression spotting.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (63)
  1. P. Ekman, “Emotions revealed: Recognizing faces and feelings to improve communication and emotional life,” Holt Paperback, vol. 128, no. 8, pp. 140–140, 2003.
  2. C. A. Corneanu, M. O. Simón, J. F. Cohn, and S. E. Guerrero, “Survey on RGB, 3D, thermal, and multimodal approaches for facial expression recognition: History, trends, and affect-related applications,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 8, pp. 1548–1568, 2016.
  3. T. Fukuda, J. Taguri, F. Arai, M. Nakashima, D. Tachibana, and Y. Hasegawa, “Facial expression of robot face for human-robot mutual communication,” in Proceedings of the International Conference on Robotics and Automation, vol. 1.   IEEE, 2002, pp. 46–51.
  4. B. A. Kopper and D. L. Epperson, “The experience and expression of anger: Relationships with gender, gender role socialization, depression, and mental health functioning.” Journal of Counseling Psychology, vol. 43, no. 2, p. 158, 1996.
  5. J. Thies, M. Zollhöfer, M. Stamminger, C. Theobalt, and M. Nießner, “FaceVR: Real-time gaze-aware facial reenactment in virtual reality,” ACM Transactions on Graphics, vol. 37, no. 2, pp. 1–15, 2018.
  6. W.-J. Yan, Q. Wu, J. Liang, Y.-H. Chen, and X. Fu, “How fast are the leaked facial expressions: The duration of micro-expressions,” Journal of Nonverbal Behavior, vol. 37, pp. 217–230, 2013.
  7. B. Bhushan, “Study of facial micro-expressions in psychology,” Understanding Facial Expressions in Communication: Cross-cultural and Multidisciplinary Perspectives, pp. 265–286, 2015.
  8. J. Endres and A. Laidlaw, “Micro-expression recognition training in medical students: a pilot study,” BMC Medical Education, vol. 9, no. 1, pp. 1–6, 2009.
  9. M. O’sullivan, M. G. Frank, C. M. Hurley, and J. Tiwana, “Police lie detection accuracy: The effect of lie scenario.” Law and Human Behavior, vol. 33, no. 6, p. 530, 2009.
  10. Y. Gan, S. Liong, D. Zheng, S. Li, and C. Bin, “Optical strain based macro-and micro-expression sequence spotting in long video,” in Proceedings of the International Conference on Automatic Face and Gesture Recognition.   IEEE, 2020.
  11. Y. He, S.-J. Wang, J. Li, and M. H. Yap, “Spotting macro-and micro-expression intervals in long video sequences,” in Proceedings of the International Conference on Automatic Face and Gesture Recognition.   IEEE, 2020, pp. 742–748.
  12. H. Yuhong, “Research on micro-expression spotting method based on optical flow features,” in Proceedings of the International Conference on Multimedia.   ACM, 2021, pp. 4803–4807.
  13. Y. Zhao, X. Tong, Z. Zhu, J. Sheng, L. Dai, L. Xu, X. Xia, Y. Jiang, and J. Li, “Rethinking optical flow methods for micro-expression spotting,” in Proceedings of the International Conference on Multimedia.   ACM, 2022, pp. 7175–7179.
  14. M. Verburg and V. Menkovski, “Micro-expression detection in long videos using optical flow and recurrent neural networks,” in Proceedings of the International Conference on Automatic Face and Gesture Recognition.   IEEE, 2019, pp. 1–6.
  15. S.-J. Wang, Y. He, J. Li, and X. Fu, “MESNet: A convolutional neural network for spotting multi-scale micro-expression intervals in long videos,” IEEE Transactions on Image Processing, vol. 30, pp. 3956–3969, 2021.
  16. C. H. Yap, M. H. Yap, A. Davison, C. Kendrick, J. Li, S.-J. Wang, and R. Cunningham, “3D-CNN for facial micro-and macro-expression spotting on long video sequences using temporal oriented reference frame,” in Proceedings of the International Conference on Multimedia.   ACM, 2022, pp. 7016–7020.
  17. S. Yin, S. Wu, T. Xu, S. Liu, S. Zhao, and E. Chen, “AU-aware graph convolutional network for macroand micro-expression spotting,” in Proceedings of the International Conference on Multimedia and Expo.   IEEE, 2023, pp. 228–233.
  18. W.-W. Yu, J. Jiang, K.-F. Yang, H.-M. Yan, and Y.-J. Li, “LGSNet: A two-stream network for micro- and macro-expression spotting with background modeling,” IEEE Transactions on Affective Computing, pp. 1–18, 2023.
  19. X. Guo, X. Zhang, L. Li, and Z. Xia, “Micro-expression spotting with multi-scale local transformer in long videos,” Pattern Recognition Letters, vol. 168, pp. 146–152, 2023.
  20. Y. Guo, B. Li, X. Ben, Y. Ren, J. Zhang, R. Yan, and Y. Li, “A magnitude and angle combined optical flow feature for microexpression spotting,” IEEE MultiMedia, vol. 28, no. 2, pp. 29–39, 2021.
  21. W.-W. Yu, J. Jiang, and Y.-J. Li, “LSSNet: A two-stream convolutional neural network for spotting macro-and micro-expression in long videos,” in Proceedings of the International Conference on Multimedia.   ACM, 2021, pp. 4745–4749.
  22. W. Leng, S. Zhao, Y. Zhang, S. Liu, X. Mao, H. Wang, T. Xu, and E. Chen, “Abpn: Apex and boundary perception network for micro-and macro-expression spotting,” in Proceedings of the International Conference on Multimedia.   ACM, 2022, pp. 7160–7164.
  23. J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” in Proceedings of the Conference on Computer Vision and Pattern Recognition.   IEEE, 2017, pp. 6299–6308.
  24. Y. Deng, H. Hayashi, and H. Nagahara, “Multi-scale spatio-temporal graph convolutional network for facial expression spotting,” in Proceedings of the International Conference on Automatic Face and Gesture Recognition.   IEEE, 2024.
  25. A. Moilanen, G. Zhao, and M. Pietikäinen, “Spotting rapid facial movements from videos using appearance-based feature difference analysis,” in Proceedings of the International Conference on Pattern Recognition.   IEEE, 2014, pp. 1722–1727.
  26. A. K. Davison, M. H. Yap, and C. Lansley, “Micro-facial movement detection using individualised baselines and histogram-based descriptors,” in Proceedings of the International Conference on Systems, Man, and Cybernetics.   IEEE, 2015, pp. 1864–1869.
  27. S.-J. Wang, S. Wu, X. Qian, J. Li, and X. Fu, “A main directional maximal difference analysis for spotting facial movements from long-term videos,” Neurocomputing, vol. 230, pp. 382–389, 2017.
  28. Z. Zhang, T. Chen, H. Meng, G. Liu, and X. Fu, “SMEConvNet: A convolutional neural network for spotting spontaneous facial micro-expression from long videos,” IEEE Access, vol. 6, pp. 71 143–71 151, 2018.
  29. G.-B. Liong, J. See, and L.-K. Wong, “Shallow optical flow three-stream cnn for macro-and micro-expression spotting from long videos,” in Proceedings of the International Conference on Image Processing.   IEEE, 2021, pp. 2643–2647.
  30. B. Yang, J. Wu, Z. Zhou, M. Komiya, K. Kishimoto, J. Xu, K. Nonaka, T. Horiuchi, S. Komorita, G. Hattori et al., “Facial action unit-based deep learning framework for spotting macro-and micro-expressions in long video sequences,” in Proceedings of the International Conference on Multimedia.   ACM, 2021, pp. 4794–4798.
  31. Y.-J. Liu, J.-K. Zhang, W.-J. Yan, S.-J. Wang, G. Zhao, and X. Fu, “A main directional mean optical flow feature for spontaneous micro-expression recognition,” IEEE Transactions on Affective Computing, vol. 7, no. 4, pp. 299–310, 2015.
  32. C.-L. Zhang, J. Wu, and Y. Li, “Actionformer: Localizing moments of actions with transformers,” in Proceedings of the European Conference on Computer Vision.   Springer, 2022, pp. 492–510.
  33. L. Yang, H. Peng, D. Zhang, J. Fu, and J. Han, “Revisiting anchor mechanisms for temporal action localization,” IEEE Transactions on Image Processing, vol. 29, pp. 8535–8548, 2020.
  34. H. Xu, A. Das, and K. Saenko, “R-c3d: Region convolutional 3d network for temporal activity detection,” in Proceedings of the International Conference on Computer Vision (ICCV).   IEEE, Oct 2017.
  35. Y.-W. Chao, S. Vijayanarasimhan, B. Seybold, D. A. Ross, J. Deng, and R. Sukthankar, “Rethinking the faster r-cnn architecture for temporal action localization,” in Proceedings of the Conference on Computer Vision and Pattern Recognition.   IEEE, June 2018.
  36. T. Lin, X. Zhao, H. Su, C. Wang, and M. Yang, “Bsn: Boundary sensitive network for temporal action proposal generation,” in Proceedings of the European Conference on Computer Vision.   Springer, September 2018.
  37. T. Lin, X. Liu, X. Li, E. Ding, and S. Wen, “Bmn: Boundary-matching network for temporal action proposal generation,” in Proceedings of the International Conference on Computer Vision.   IEEE/CVF, October 2019.
  38. P. Zhao, L. Xie, C. Ju, Y. Zhang, Y. Wang, and Q. Tian, “Bottom-up temporal action localization with mutual regularization,” in Proceedings of the European Conference on Computer Vision.   Springer, 2020, pp. 539–555.
  39. Z. Zhu, W. Tang, L. Wang, N. Zheng, and G. Hua, “Enriching local and global contexts for temporal action localization,” in Proceedings of the International Conference on Computer Vision.   IEEE/CVF, October 2021, pp. 13 516–13 525.
  40. S. Nag, X. Zhu, Y.-Z. Song, and T. Xiang, “Post-processing temporal action detection,” in Proceedings of the Conference on Computer Vision and Pattern Recognition.   IEEE/CVF, June 2023, pp. 18 837–18 845.
  41. J. Shao, X. Wang, R. Quan, J. Zheng, J. Yang, and Y. Yang, “Action sensitivity learning for temporal action localization,” in Proceedings of the International Conference on Computer Vision.   IEEE/CVF, October 2023, pp. 13 457–13 469.
  42. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proceedings of the International Conference on Neural Information Processing Systems.   Curran Associates Inc., 2017, p. 6000–6010.
  43. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” in Proceedings of the International Conference on Learning Representations, 2021.
  44. Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the International Conference on Computer Vision.   IEEE/CVF, 2021, pp. 10 012–10 022.
  45. A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lučić, and C. Schmid, “ViVit: A video vision transformer,” in Proceedings of the International Conference on Computer Vision.   IEEE/CVF, October 2021, pp. 6836–6846.
  46. Z. Liu, J. Ning, Y. Cao, Y. Wei, Z. Zhang, S. Lin, and H. Hu, “Video swin transformer,” in Proceedings of the Conference on Computer Vision and Pattern Recognition.   IEEE/CVF, June 2022, pp. 3202–3211.
  47. T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in Proceedings of the International Conference on Machine Learning.   PMLR, 2020, pp. 1597–1607.
  48. P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y. Tian, P. Isola, A. Maschinot, C. Liu, and D. Krishnan, “Supervised contrastive learning,” in Proceedings of the Advances in Neural Information Processing Systems, vol. 33, 2020, pp. 18 661–18 673.
  49. C. Chen, “PyTorch Face Landmark: A fast and accurate facial landmark detector,” 2021. [Online]. Available: https://github.com/cunjian/pytorch_face_landmark
  50. G. Farnebäck, “Two-frame motion estimation based on polynomial expansion,” in Proceedings of the Scandinavian Conference on Image Analysis.   Springer, 2003, pp. 363–370.
  51. Y. Zhang, B. Wu, W. Li, L. Duan, and C. Gan, “STST: Spatial-temporal specialized transformer for skeleton-based action recognition,” in Proceedings of the International Conference on Multimedia.   ACM, 2021, p. 3229–3237.
  52. T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” in Proceedings of the Conference on Computer Vision and Pattern Recognition.   IEEE/CVF, 2017, pp. 2117–2125.
  53. H. Fan, B. Xiong, K. Mangalam, Y. Li, Z. Yan, J. Malik, and C. Feichtenhofer, “Multiscale vision transformers,” in Proceedings of the International Conference on Computer Vision.   IEEE/CVF, October 2021, pp. 6824–6835.
  54. T. Xu and W. Takano, “Graph stacked hourglass networks for 3D human pose estimation,” in Proceedings of the Conference on Computer Vision and Pattern Recognition.   IEEE/CVF, 2021, pp. 16 105–16 114.
  55. T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in Proceedings of the International Conference on Computer Vision.   IEEE/CVF, 2017, pp. 2980–2988.
  56. L.-W. Zhang, J. Li, S.-J. Wang, X.-H. Duan, W.-J. Yan, H.-Y. Xie, and S.-C. Huang, “Spatio-temporal fusion for macro-and micro-expression spotting in long video sequences,” in Proceedings of the International Conference on Automatic Face and Gesture Recognition.   IEEE, 2020, pp. 734–741.
  57. J. Wang, S. Xu, and T. Zhang, “A unique M-pattern for micro-expression spotting in long videos,” in Proceedings of the International Conference on Learning Representations, 2024.
  58. H. Pan, L. Xie, and Z. Wang, “Local bilinear convolutional neural network for spotting macro-and micro-expression intervals in long video sequences,” in Proceedings of the International Conference on Automatic Face and Gesture Recognition.   IEEE, 2020, pp. 749–753.
  59. G. B. Liong, S.-T. Liong, J. See, and C.-S. Chan, “MTSN: A multi-temporal stream network for spotting facial macro-and micro-expression with hard and soft pseudo-labels,” in Proceedings of the Workshop on Facial Micro-Expression: Advanced Techniques for Multi-Modal Facial Expression Analysis.   ACM, 2022, pp. 3–10.
  60. C. H. Yap, C. Kendrick, and M. H. Yap, “Samm long videos: A spontaneous facial micro-and macro-expressions dataset,” in Proceedings of the International Conference on Automatic Face and Gesture Recognition.   IEEE, 2020, pp. 771–776.
  61. F. Qu, S.-J. Wang, W.-J. Yan, H. Li, S. Wu, and X. Fu, “CAS(ME)2: A database for spontaneous macro-expression and micro-expression spotting and recognition,” IEEE Transactions on Affective Computing, vol. 9, no. 4, pp. 424–436, 2017.
  62. I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in Proceedings of the International Conference on Learning Representations, 2019.
  63. R. Bro and A. K. Smilde, “Principal component analysis,” Analytical Methods, vol. 6, no. 9, pp. 2812–2831, 2014.

Summary

  • The paper introduces a novel SW-MRO feature that efficiently captures subtle micro-expressions while mitigating noise from head movements.
  • The SpotFormer architecture employs multi-scale spatio-temporal transformers with Facial Local Graph Pooling to extract robust facial features.
  • Supervised contrastive learning enhances expression differentiation, leading to a 49.4% F1-score improvement on micro-expression spotting.

Analysis of "SpotFormer: Multi-Scale Spatio-Temporal Transformer for Facial Expression Spotting"

This paper presents a novel framework for facial expression spotting, a critical task in facial expression analysis involving the identification of temporal segments in video data where specific facial expressions occur. The authors introduce several innovations aimed at addressing the persistent challenges of spotting subtle micro-expressions (MEs) and distinguishing these from more pronounced macro-expressions (MaEs) despite the noise caused by irrelevant facial movements such as head motion.

Key Contributions

  1. SW-MRO Features: The introduction of the Sliding Window-based Multi-Resolution Optical flow (SW-MRO) feature is an attempt to capture subtle facial expressions efficiently. Unlike traditional optical flow between adjacent frames, the SW-MRO function operates within compact sliding windows that allow for full perception of micro-expressions while effectively managing head movement noise. This capability addresses the need for distinguishing the more transient MEs from MaEs, leveraging temporal windows to balance sensitivity and robustness in spotting expressions.
  2. SpotFormer Architecture: The development of the SpotFormer, a multi-scale spatio-temporal Transformer, is central to the proposed framework. This architecture encodes spatio-temporal relationships using Facial Local Graph Pooling (FLGP) operations, supporting robust feature extraction across multiple scales and dimensions. This transformer-based approach is aligned with general trends in AI that utilize Transformers for modeling relationships and dynamics in both spatial and temporal domains.
  3. Supervised Contrastive Learning: By incorporating supervised contrastive learning, the model enhances its ability to distinguish between different expression types. This methodological enhancement underpins a more nuanced and effective classification approach that contributes significantly to its improved performance, particularly in ME spotting tasks.

Empirical Results

The proposed framework's performance is evaluated on the SAMM-LV and CAS(ME)2^2 datasets, where it demonstrates superior performance over state-of-the-art models. Notably, the framework achieves significant improvements in micro-expression spotting, indicating the efficacy of the SW-MRO features and the SpotFormer architecture. This improvement is quantitatively illustrated by a 49.4% increase in F1-score on the SAMM-LV dataset compared to existing methods.

Theoretical and Practical Implications

The implications of this work are both broad and specific. Theoretically, it offers insights into the potential for Transformers to model complex, multi-scale temporal dynamics in video data beyond their traditional application in natural language processing. Practically, these advancements can enhance systems that rely on reliable facial expression analysis, with applications spanning security, mental health diagnostics, and human-computer interaction.

Speculation on Future Developments

Looking forward, the exploration of more computationally efficient approaches for extracting motion features stands as an open avenue for research. The challenges of achieving balance between optimal sensitivity in spotting expressions and computational load indicate opportunities for optimization, potentially incorporating end-to-end learning frameworks that might subsume the current pre-processing steps.

The paper makes important contributions to the field of facial expression analysis, particularly in addressing the nuanced challenge of micro-expression detection. The innovations proposed have the potential to set new benchmarks for future research focused on improving both accuracy and computational efficiency in facial expression analysis systems.

Youtube Logo Streamline Icon: https://streamlinehq.com