MSSTNet: A Multi-Scale Spatio-Temporal CNN-Transformer Network for Dynamic Facial Expression Recognition (2404.08433v1)
Abstract: Unlike typical video action recognition, Dynamic Facial Expression Recognition (DFER) does not involve distinct moving targets but relies on localized changes in facial muscles. Addressing this distinctive attribute, we propose a Multi-Scale Spatio-temporal CNN-Transformer network (MSSTNet). Our approach takes spatial features of different scales extracted by CNN and feeds them into a Multi-scale Embedding Layer (MELayer). The MELayer extracts multi-scale spatial information and encodes these features before sending them into a Temporal Transformer (T-Former). The T-Former simultaneously extracts temporal information while continually integrating multi-scale spatial information. This process culminates in the generation of multi-scale spatio-temporal features that are utilized for the final classification. Our method achieves state-of-the-art results on two in-the-wild datasets. Furthermore, a series of ablation experiments and visualizations provide further validation of our approach's proficiency in leveraging spatio-temporal information within DFER.
- “Centermatch: A center matching method for semi-supervised facial expression recognition,” in Chinese Conference on Pattern Recognition and Computer Vision. Springer, 2023, pp. 371–383.
- “Emotion recognition in the wild challenge 2013,” in Proceedings of the 15th ACM on International conference on multimodal interaction, 2013, pp. 509–516.
- “Improved spatiotemporal local monogenic binary pattern for emotion recognition in the wild,” in Proceedings of the 16th International Conference on Multimodal Interaction, 2014, pp. 514–520.
- “Emotion recognition in the wild with feature fusion and multiple kernel learning,” in Proceedings of the 16th International Conference on Multimodal Interaction, 2014, pp. 508–513.
- “Multi-attention fusion network for video-based emotion recognition,” in 2019 International Conference on Multimodal Interaction, 2019, pp. 595–601.
- “Recurrent neural networks for emotion recognition in video,” in Proceedings of the 2015 ACM on international conference on multimodal interaction, 2015, pp. 467–474.
- “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
- “Empirical evaluation of gated recurrent neural networks on sequence modeling,” arXiv preprint arXiv:1412.3555, 2014.
- “Video-based emotion recognition using cnn-rnn and c3d hybrid networks,” in Proceedings of the 18th ACM international conference on multimodal interaction, 2016, pp. 445–450.
- “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
- “Former-dfer: Dynamic facial expression recognition transformer,” in Proceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 1553–1561.
- “Nr-dfernet: Noise-robust network for dynamic facial expression recognition,” arXiv preprint arXiv:2206.04975, 2022.
- “Is space-time attention all you need for video understanding?,” in ICML, 2021, vol. 2, p. 4.
- “Self-supervised video hashing via bidirectional transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 13549–13558.
- “Anchor-based spatio-temporal attention 3-d convolutional networks for dynamic 3-d point cloud sequences,” IEEE Transactions on Instrumentation and Measurement, vol. 70, pp. 1–11, 2021.
- “Logo-former: Local-global spatio-temporal transformer for dynamic facial expression recognition,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
- “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
- “Ferv39k: A large-scale multi-scene dataset for facial expression recognition in videos,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 20922–20931.
- “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
- “Learning spatiotemporal features with 3d convolutional networks,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 4489–4497.
- “A closer look at spatiotemporal convolutions for action recognition,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2018, pp. 6450–6459.
- “Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2018, pp. 6546–6555.
- “Dpcnet: Dual path multi-excitation collaborative network for facial expression representation learning in videos,” in Proceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 101–110.
- “Dfew: A large-scale database for recognizing dynamic facial expressions in the wild,” in Proceedings of the 28th ACM international conference on multimedia, 2020, pp. 2881–2889.
- Linhuang Wang (1 paper)
- Xin Kang (30 papers)
- Fei Ding (72 papers)
- Satoshi Nakagawa (5 papers)
- Fuji Ren (18 papers)