PESFormer: Boosting Macro- and Micro-expression Spotting with Direct Timestamp Encoding (2410.18695v1)
Abstract: The task of macro- and micro-expression spotting aims to precisely localize and categorize temporal expression instances within untrimmed videos. Given the sparse distribution and varying durations of expressions, existing anchor-based methods often represent instances by encoding their deviations from predefined anchors. Additionally, these methods typically slice the untrimmed videos into fixed-length sliding windows. However, anchor-based encoding often fails to capture all training intervals, and slicing the original video as sliding windows can result in valuable training intervals being discarded. To overcome these limitations, we introduce PESFormer, a simple yet effective model based on the vision transformer architecture to achieve point-to-interval expression spotting. PESFormer employs a direct timestamp encoding (DTE) approach to replace anchors, enabling binary classification of each timestamp instead of optimizing entire ground truths. Thus, all training intervals are retained in the form of discrete timestamps. To maximize the utilization of training intervals, we enhance the preprocessing process by replacing the short videos produced through the sliding window method.Instead, we implement a strategy that involves zero-padding the untrimmed training videos to create uniform, longer videos of a predetermined duration. This operation efficiently preserves the original training intervals and eliminates video slice enhancement.Extensive qualitative and quantitative evaluations on three datasets -- CAS(ME)2, CAS(ME)3 and SAMM-LV -- demonstrate that our PESFormer outperforms existing techniques, achieving the best performance.
- P. Ekman and W. V. Friesen, “Nonverbal leakage and clues to deception,” Psychiatry, vol. 32, no. 1, pp. 88–106, 1969.
- P. Ekman, “Darwin, deception, and facial expression,” Ann. New York Acad. Sci., vol. 1000, no. 1, pp. 205–221, 2003.
- S. Wang, Y. He, J. Li, and X. Fu, “MESNet: A convolutional neural network for spotting multi-scale micro-expression intervals in long videos,” IEEE Trans. Image Process., vol. 30, pp. 3956–3969, 2021.
- M. Frank, M. Herbasz, K. Sinuk, A. Keller, and C. Nolan, “I see how you feel: Training laypeople and professionals to recognize fleeting emotions,” in Proc. Annu. Meet. Int. Commun. Assoc., New York City, 2009, pp. 1–35.
- P. Ekman, “Lie catching and microexpressions,” The philosophy of deception, vol. 1, no. 2, p. 5, 2009.
- W. Xie, L. Shen, and J. Duan, “Adaptive weighting of handcrafted feature losses for facial expression recognition,” IEEE Trans. Cybern., vol. 51, no. 5, pp. 2787–2800, 2021.
- S. Du, Y. Tao, and A. M. Martínez, “Compound facial expressions of emotion,” Proc. Natl. Acad. Sci. USA, vol. 111, no. 15, pp. E1454–E1462, 2014.
- S. Li, W. Deng, and J. Du, “Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild,” in CVPR. IEEE Computer Society, 2017, pp. 2584–2593.
- L. Liang, C. Lang, Y. Li, S. Feng, and J. Zhao, “Fine-grained facial expression recognition in the wild,” IEEE Trans. Inf. Forensics Secur., vol. 16, pp. 482–494, 2021.
- P. Ekman, “Facial expression and emotion,” American Psychologist, vol. 48, pp. 384–392, 1993.
- A. Esposito, “The amount of information on emotional states conveyed by the verbal and nonverbal channels: Some perceptual data,” in Progress in nonlinear speech processing. Springer, 2007, pp. 249–268.
- F. Qu, S. Wang, W. Yan, H. Li, S. Wu, and X. Fu, “CAS(ME)22{}^{\mbox{2}}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT: A database for spontaneous macro-expression and micro-expression spotting and recognition,” IEEE Trans. Affect. Comput., vol. 9, no. 4, pp. 424–436, 2018.
- C. H. Yap, C. Kendrick, and M. H. Yap, “SAMM long videos: A spontaneous facial micro- and macro-expressions dataset,” in FG. IEEE, 2020, pp. 771–776.
- X. Ben, Y. Ren, J. Zhang, S. Wang, K. Kpalma, W. Meng, and Y. Liu, “Video-based facial micro-expression analysis: A survey of datasets, features and algorithms,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 9, pp. 5826–5846, 2022.
- J. Li, Z. Dong, S. Lu, S.-J. Wang, W.-J. Yan, Y. Ma, Y. Liu, C. Huang, and X. Fu, “CAS(ME)3: A third generation facial spontaneous micro-expression database with depth information and high ecological validity,” IEEE Trans. Pattern Anal. Mach. Intell., 2022.
- W. Yu, J. Jiang, and Y. Li, “Lssnet: A two-stream convolutional neural network for spotting macro- and micro-expression in long videos,” in ACM Multimedia. ACM, 2021, pp. 4745–4749.
- X. Guo, X. Zhang, L. Li, and Z. Xia, “Micro-expression spotting with multi-scale local transformer in long videos,” Pattern Recognit. Lett., vol. 168, pp. 146–152, 2023.
- W. Yu, J. Jiang, K. Yang, H. Yan, and Y. Li, “Lgsnet: A two-stream network for micro- and macro-expression spotting with background modeling,” IEEE Trans. Affect. Comput., vol. 15, no. 1, pp. 223–240, 2024.
- B. Sun, S. Cao, J. He, and L. Yu, “Two-stream attention-aware network for spontaneous micro-expression movement spotting,” in ICSESS. IEEE, 2019, pp. 702–705.
- T. Tran, Q. Vo, X. Hong, and G. Zhao, “Dense prediction for micro-expression spotting based on deep sequence model,” Electronic Imaging, vol. 2019, no. 8, pp. 401–1, 2019.
- M. Verburg and V. Menkovski, “Micro-expression detection in long videos using optical flow and recurrent neural networks,” in FG. IEEE, 2019, pp. 1–6.
- G. Liong, J. See, and L. Wong, “Shallow optical flow three-stream CNN for macro- and micro-expression spotting from long videos,” in ICIP. IEEE, 2021, pp. 2643–2647.
- H. Pan, L. Xie, and Z. Wang, “Local bilinear convolutional neural network for spotting macro- and micro-expression intervals in long video sequences,” in FG. IEEE, 2020, pp. 749–753.
- Z. Zhang, T. Chen, H. Meng, G. Liu, and X. Fu, “SMEConvNet: A convolutional neural network for spotting spontaneous facial micro-expression from long videos,” IEEE Access, vol. 6, pp. 71 143–71 151, 2018.
- C. H. Yap, M. H. Yap, A. K. Davison, C. Kendrick, J. Li, S. Wang, and R. Cunningham, “3d-cnn for facial micro- and macro-expression spotting on long video sequences using temporal oriented reference frame,” pp. 7016–7020, 2022.
- C. Zhang, J. Wu, and Y. Li, “Actionformer: Localizing moments of actions with transformers,” in ECCV (4), ser. Lecture Notes in Computer Science, vol. 13664. Springer, 2022, pp. 492–510.
- M. Xu, M. Soldan, J. Gao, S. Liu, J. Pérez-Rúa, and B. Ghanem, “Boundary-denoising for video activity localization,” CoRR, vol. abs/2304.02934, 2023.
- T. Lin, X. Zhao, and Z. Shou, “Single shot temporal action detection,” in ACM Multimedia. ACM, 2017, pp. 988–996.
- Y. Chao, S. Vijayanarasimhan, B. Seybold, D. A. Ross, J. Deng, and R. Sukthankar, “Rethinking the faster R-CNN architecture for temporal action localization,” in CVPR. Computer Vision Foundation / IEEE Computer Society, 2018, pp. 1130–1139.
- H. Xu, A. Das, and K. Saenko, “R-C3D: region convolutional 3d network for temporal activity detection,” in ICCV. IEEE Computer Society, 2017, pp. 5794–5803.
- S. Ren, K. He, R. B. Girshick, and J. Sun, “Faster R-CNN: towards real-time object detection with region proposal networks,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 6, pp. 1137–1149, 2017.
- L. Yang, H. Peng, D. Zhang, J. Fu, and J. Han, “Revisiting anchor mechanisms for temporal action localization,” IEEE Trans. Image Process., vol. 29, pp. 8535–8548, 2020.
- Y. Zhao, Y. Xiong, L. Wang, Z. Wu, X. Tang, and D. Lin, “Temporal action detection with structured segment networks,” in ICCV. IEEE Computer Society, 2017, pp. 2933–2942.
- T. Lin, X. Zhao, H. Su, C. Wang, and M. Yang, “BSN: boundary sensitive network for temporal action proposal generation,” in ECCV (4), ser. Lecture Notes in Computer Science, vol. 11208. Springer, 2018, pp. 3–21.
- T. Lin, X. Liu, X. Li, E. Ding, and S. Wen, “BMN: boundary-matching network for temporal action proposal generation,” in ICCV. IEEE, 2019, pp. 3888–3897.
- H. Su, W. Gan, W. Wu, Y. Qiao, and J. Yan, “BSN++: complementary boundary regressor with scale-balanced relation modeling for temporal action proposal generation,” in AAAI. AAAI Press, 2021, pp. 2602–2610.
- Z. Qing, H. Su, W. Gan, D. Wang, W. Wu, X. Wang, Y. Qiao, J. Yan, C. Gao, and N. Sang, “Temporal context aggregation network for temporal action proposal refinement,” in CVPR. Computer Vision Foundation / IEEE, 2021, pp. 485–494.
- C. Lin, C. Xu, D. Luo, Y. Wang, Y. Tai, C. Wang, J. Li, F. Huang, and Y. Fu, “Learning salient boundary feature for anchor-free temporal action localization,” in CVPR. Computer Vision Foundation / IEEE, 2021, pp. 3320–3329.
- J. Carreira and A. Zisserman, “Quo vadis, action recognition? A new model and the kinetics dataset,” in CVPR. IEEE Computer Society, 2017, pp. 4724–4733.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in NIPS, 2017, pp. 5998–6008.
- J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,” CoRR, vol. abs/1810.04805, 2018.
- A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” in ICLR. OpenReview.net, 2021.
- Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in ICCV. IEEE, 2021, pp. 9992–10 002.
- Z. Liu, H. Hu, Y. Lin, Z. Yao, Z. Xie, Y. Wei, J. Ning, Y. Cao, Z. Zhang, L. Dong, F. Wei, and B. Guo, “Swin transformer V2: scaling up capacity and resolution,” in CVPR. IEEE, 2022, pp. 11 999–12 009.
- H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jégou, “Training data-efficient image transformers & distillation through attention,” in ICML, ser. Proceedings of Machine Learning Research, vol. 139. PMLR, 2021, pp. 10 347–10 357.
- W. Wang, E. Xie, X. Li, D. Fan, K. Song, D. Liang, T. Lu, P. Luo, and L. Shao, “Pyramid vision transformer: A versatile backbone for dense prediction without convolutions,” in ICCV. IEEE, 2021, pp. 548–558.
- N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in ECCV (1), ser. Lecture Notes in Computer Science, vol. 12346. Springer, 2020, pp. 213–229.
- X. Dai, Y. Chen, J. Yang, P. Zhang, L. Yuan, and L. Zhang, “Dynamic DETR: end-to-end object detection with dynamic attention,” in ICCV. IEEE, 2021, pp. 2968–2977.
- Y. Wang, Z. Xu, X. Wang, C. Shen, B. Cheng, H. Shen, and H. Xia, “End-to-end video instance segmentation with transformers,” in CVPR. Computer Vision Foundation / IEEE, 2021, pp. 8741–8750.
- B. Cheng, A. G. Schwing, and A. Kirillov, “Per-pixel classification is not all you need for semantic segmentation,” in NeurIPS, 2021, pp. 17 864–17 875.
- A. Yang, A. Miech, J. Sivic, I. Laptev, and C. Schmid, “Tubedetr: Spatio-temporal video grounding with transformers,” in CVPR. IEEE, 2022, pp. 16 421–16 432.
- W. Yu, K. Yang, H. Yan, and Y. Li, “Weakly-supervised micro- and macro-expression spotting based on multi-level consistency,” CoRR, vol. abs/2305.02734, 2023.
- W. Yu, X. Zhang, F. Luo, Y. Cao, K. Yang, H. Yan, and Y. Li, “Weak supervision with arbitrary single frame for micro-and macro-expression spotting,” CoRR, vol. abs/2403.14240, 2023.
- A. Wedel, T. Pock, C. Zach, H. Bischof, and D. Cremers, “An improved algorithm for tv-L 11{}^{\mbox{1}}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT optical flow,” in Statistical and Geometrical Approaches to Visual Motion Analysis, ser. Lecture Notes in Computer Science. Springer, 2008, vol. 5604, pp. 23–45.
- T. Xiao, M. Singh, E. Mintun, T. Darrell, P. Dollár, and R. B. Girshick, “Early convolutions help transformers see better,” in NeurIPS, 2021, pp. 30 392–30 400.
- S. Li, X. Jin, Y. Xuan, X. Zhou, W. Chen, Y. Wang, and X. Yan, “Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting,” in NeurIPS, 2019, pp. 5244–5254.
- Q. Wang, B. Li, T. Xiao, J. Zhu, C. Li, D. F. Wong, and L. S. Chao, “Learning deep transformer models for machine translation,” in ACL (1). Association for Computational Linguistics, 2019, pp. 1810–1822.
- A. Baevski and M. Auli, “Adaptive input representations for neural language modeling,” in ICLR (Poster). OpenReview.net, 2019.
- K. M. Choromanski, V. Likhosherstov, D. Dohan, X. Song, A. Gane, T. Sarlós, P. Hawkins, J. Q. Davis, A. Mohiuddin, L. Kaiser, D. B. Belanger, L. J. Colwell, and A. Weller, “Rethinking attention with performers,” in ICLR. OpenReview.net, 2021.
- Y. Zhong, J. Wang, J. Peng, and L. Zhang, “Anchor box optimization for object detection,” in WACV. IEEE, 2020, pp. 1286–1294.
- F. Cheng and G. Bertasius, “TallFormer: Temporal action localization with a long-memory transformer,” in ECCV, vol. 13694. Springer, 2022, pp. 503–521.
- T. Yang, X. Zhang, Z. Li, W. Zhang, and J. Sun, “Metaanchor: Learning to detect objects with customized anchors,” in NIPS, 2018.
- B. Xu, Y. Fu, Y. Jiang, B. Li, and L. Sigal, “Heterogeneous knowledge transfer in video emotion recognition, attribution and summarization,” IEEE Trans. Affect. Comput., vol. 9, no. 2, pp. 255–270, 2018.
- T.-Y. Lin, P. Goyal, R. Girshick, and K. H. P. Dollár, “Focal loss for dense object detection,” in ICCV. IEEE, 2017, pp. 2980–2988.
- F. Milletari, N. Navab, and S.-A. Ahmadi, “V-net: Fully convolutional neural networks for volumetric medical image segmentation,” in 3DV. IEEE, 2016, pp. 565–571.
- P. Lee and H. Byun, “Learning action completeness from points for weakly-supervised temporal action localization,” in ICCV. IEEE, 2021, pp. 13 628–13 637.
- S. Paul, S. Roy, and A. K. Roy-Chowdhury, “W-TALC: weakly-supervised temporal activity localization and classification,” in ECCV (4), ser. Lecture Notes in Computer Science, vol. 11208. Springer, 2018, pp. 588–607.
- P. Ekman, “Emotions revealed: recognizing faces and feelings to improve communication and emotional life,” NY: OWL Books, 2007.
- J. See, M. H. Yap, J. Li, X. Hong, and S. Wang, “MEGC 2019 - the second facial micro-expressions grand challenge,” in FG. IEEE, 2019, pp. 1–5.
- J. Li, M. H. Yap, W. Cheng, J. See, X. Hong, X. Li, S. Wang, A. K. Davison, Y. Li, and Z. Dong, “MEGC2022: ACM multimedia 2022 micro-expression grand challenge,” in ACM Multimedia. ACM, 2022, pp. 7170–7174.
- D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in ICLR (Poster), 2015.
- Y. He, S. Wang, J. Li, and M. H. Yap, “Spotting macro-and micro-expression intervals in long video sequences,” in FG. IEEE, 2020, pp. 742–748.
- L. Zhang, J. Li, S. Wang, X. Duan, W. Yan, H. Xie, and S. Huang, “Spatio-temporal fusion for macro- and micro-expression spotting in long video sequences,” in FG. IEEE, 2020, pp. 734–741.
- Y. He, “Research on micro-expression spotting method based on optical flow features,” in ACM Multimedia. ACM, 2021, pp. 4803–4807.
- G. B. Liong, S. Liong, J. See, and C. Chee-Seng, “MTSN: A multi-temporal stream network for spotting facial macro-and micro-expression with hard and soft pseudo-labels,” in Proceedings of the 2nd Workshop on Facial Micro-Expression: Advanced Techniques for Multi-Modal Facial Expression Analysis, 2022, pp. 3–10.
- Y. Zhao, X. Tong, Z. Zhu, J. Sheng, L. Dai, L. Xu, X. Xia, Y. Jiang, and J. Li, “Rethinking optical flow methods for micro-expression spotting,” in ACM Multimedia. ACM, 2022, pp. 7175–7179.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days freePaper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.