Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MSSTNet: A Multi-Scale Spatio-Temporal CNN-Transformer Network for Dynamic Facial Expression Recognition (2404.08433v1)

Published 12 Apr 2024 in cs.CV

Abstract: Unlike typical video action recognition, Dynamic Facial Expression Recognition (DFER) does not involve distinct moving targets but relies on localized changes in facial muscles. Addressing this distinctive attribute, we propose a Multi-Scale Spatio-temporal CNN-Transformer network (MSSTNet). Our approach takes spatial features of different scales extracted by CNN and feeds them into a Multi-scale Embedding Layer (MELayer). The MELayer extracts multi-scale spatial information and encodes these features before sending them into a Temporal Transformer (T-Former). The T-Former simultaneously extracts temporal information while continually integrating multi-scale spatial information. This process culminates in the generation of multi-scale spatio-temporal features that are utilized for the final classification. Our method achieves state-of-the-art results on two in-the-wild datasets. Furthermore, a series of ablation experiments and visualizations provide further validation of our approach's proficiency in leveraging spatio-temporal information within DFER.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (24)
  1. “Centermatch: A center matching method for semi-supervised facial expression recognition,” in Chinese Conference on Pattern Recognition and Computer Vision. Springer, 2023, pp. 371–383.
  2. “Emotion recognition in the wild challenge 2013,” in Proceedings of the 15th ACM on International conference on multimodal interaction, 2013, pp. 509–516.
  3. “Improved spatiotemporal local monogenic binary pattern for emotion recognition in the wild,” in Proceedings of the 16th International Conference on Multimodal Interaction, 2014, pp. 514–520.
  4. “Emotion recognition in the wild with feature fusion and multiple kernel learning,” in Proceedings of the 16th International Conference on Multimodal Interaction, 2014, pp. 508–513.
  5. “Multi-attention fusion network for video-based emotion recognition,” in 2019 International Conference on Multimodal Interaction, 2019, pp. 595–601.
  6. “Recurrent neural networks for emotion recognition in video,” in Proceedings of the 2015 ACM on international conference on multimodal interaction, 2015, pp. 467–474.
  7. “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
  8. “Empirical evaluation of gated recurrent neural networks on sequence modeling,” arXiv preprint arXiv:1412.3555, 2014.
  9. “Video-based emotion recognition using cnn-rnn and c3d hybrid networks,” in Proceedings of the 18th ACM international conference on multimodal interaction, 2016, pp. 445–450.
  10. “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  11. “Former-dfer: Dynamic facial expression recognition transformer,” in Proceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 1553–1561.
  12. “Nr-dfernet: Noise-robust network for dynamic facial expression recognition,” arXiv preprint arXiv:2206.04975, 2022.
  13. “Is space-time attention all you need for video understanding?,” in ICML, 2021, vol. 2, p. 4.
  14. “Self-supervised video hashing via bidirectional transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 13549–13558.
  15. “Anchor-based spatio-temporal attention 3-d convolutional networks for dynamic 3-d point cloud sequences,” IEEE Transactions on Instrumentation and Measurement, vol. 70, pp. 1–11, 2021.
  16. “Logo-former: Local-global spatio-temporal transformer for dynamic facial expression recognition,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
  17. “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  18. “Ferv39k: A large-scale multi-scene dataset for facial expression recognition in videos,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 20922–20931.
  19. “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
  20. “Learning spatiotemporal features with 3d convolutional networks,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 4489–4497.
  21. “A closer look at spatiotemporal convolutions for action recognition,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2018, pp. 6450–6459.
  22. “Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2018, pp. 6546–6555.
  23. “Dpcnet: Dual path multi-excitation collaborative network for facial expression representation learning in videos,” in Proceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 101–110.
  24. “Dfew: A large-scale database for recognizing dynamic facial expressions in the wild,” in Proceedings of the 28th ACM international conference on multimedia, 2020, pp. 2881–2889.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Linhuang Wang (1 paper)
  2. Xin Kang (30 papers)
  3. Fei Ding (72 papers)
  4. Satoshi Nakagawa (5 papers)
  5. Fuji Ren (18 papers)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com