Continuous Sign Language Recognition Based on Motor attention mechanism and frame-level Self-distillation (2402.19118v1)
Abstract: Changes in facial expression, head movement, body movement and gesture movement are remarkable cues in sign language recognition, and most of the current continuous sign language recognition(CSLR) research methods mainly focus on static images in video sequences at the frame-level feature extraction stage, while ignoring the dynamic changes in the images. In this paper, we propose a novel motor attention mechanism to capture the distorted changes in local motion regions during sign language expression, and obtain a dynamic representation of image changes. And for the first time, we apply the self-distillation method to frame-level feature extraction for continuous sign language, which improves the feature expression without increasing the computational resources by self-distilling the features of adjacent stages and using the higher-order features as teachers to guide the lower-order features. The combination of the two constitutes our proposed holistic model of CSLR Based on motor attention mechanism and frame-level Self-Distillation (MAM-FSD), which improves the inference ability and robustness of the model. We conduct experiments on three publicly available datasets, and the experimental results show that our proposed method can effectively extract the sign language motion information in videos, improve the accuracy of CSLR and reach the state-of-the-art level.
- L. Hu, L. Gao, Z. Liu, and W. Feng, “Continuous sign language recognition with correlation network,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 2529–2539.
- P. Xie, Z. Cui, Y. Du, M. Zhao, J. Cui, B. Wang, and X. Hu, “Multi-scale local-temporal similarity fusion for continuous sign language recognition,” Pattern Recognition, vol. 136, p. 109233, 2023.
- Y. Chen, R. Zuo, F. Wei, Y. Wu, S. Liu, and B. Mak, “Two-stream network for sign language recognition and translation,” Advances in Neural Information Processing Systems, vol. 35, pp. 17 043–17 056, 2022.
- H. Li and W. Wang, “Reinterpreting ctc training as iterative fitting,” Pattern Recognition, vol. 105, p. 107392, 2020.
- Q. Zhu, J. Li, F. Yuan, and Q. Gan, “Multi-scale temporal network for continuous sign language recognition,” arXiv preprint arXiv:2204.03864, 2022.
- H. Zhou, W. Zhou, Y. Zhou, and H. Li, “Spatial-temporal multi-cue network for continuous sign language recognition,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, 2020, pp. 13 009–13 016.
- J. Huang, W. Zhou, H. Li, and W. Li, “Attention-based 3d-cnns for large-vocabulary sign language recognition,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 29, no. 9, pp. 2822–2832, 2018.
- N. Habili, C. C. Lim, and A. Moini, “Segmentation of the face and hands in sign language video sequences using color and motion cues,” IEEE transactions on Circuits and Systems for Video Technology, vol. 14, no. 8, pp. 1086–1097, 2004.
- J. Zhang, Q. Wang, Q. Wang, and Z. Zheng, “Multimodal fusion framework based on statistical attention and contrastive attention for sign language recognition,” IEEE Transactions on Mobile Computing, 2023.
- Y. Du, P. Xie, M. Wang, X. Hu, Z. Zhao, and J. Liu, “Full transformer network with masking future for word-level sign language recognition,” Neurocomputing, vol. 500, pp. 115–123, 2022.
- K. Yin and J. Read, “Better sign language translation with stmc-transformer,” arXiv preprint arXiv:2004.00588, 2020.
- H. Hu, W. Zhao, W. Zhou, and H. Li, “Signbert+: Hand-model-aware self-supervised pre-training for sign language understanding,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
- J. Zhao, W. Qi, W. Zhou, N. Duan, M. Zhou, and H. Li, “Conditional sentence generation and cross-modal reranking for sign language translation,” IEEE Transactions on Multimedia, vol. 24, pp. 2662–2672, 2021.
- Y. Min, A. Hao, X. Chai, and X. Chen, “Visual alignment constraint for continuous sign language recognition,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 11 542–11 551.
- J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7132–7141.
- S. Woo, J. Park, J.-Y. Lee, and I. S. Kweon, “Cbam: Convolutional block attention module,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 3–19.
- M. Lee, S. Lee, S. Son, G. Park, and N. Kwak, “Motion feature network: Fixed motion filter for action recognition,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 387–403.
- G. Zhang, Y. Zhu, H. Wang, Y. Chen, G. Wu, and L. Wang, “Extracting motion and appearance via inter-frame attention for efficient video frame interpolation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 5682–5692.
- O. Koller, J. Forster, and H. Ney, “Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling multiple signers,” Computer Vision and Image Understanding, vol. 141, pp. 108–125, 2015.
- N. C. Camgoz, S. Hadfield, O. Koller, H. Ney, and R. Bowden, “Neural sign language translation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7784–7793.
- H. Zhou, W. Zhou, W. Qi, J. Pu, and H. Li, “Improving sign language translation with monolingual data by sign back-translation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 1316–1325.
- A. K. Talukdar and M. Bhuyan, “Vision-based continuous sign language spotting using gaussian hidden markov model,” IEEE Sensors Letters, vol. 6, no. 7, pp. 1–4, 2022.
- J. Zhang, W. Zhou, and H. Li, “A threshold-based hmm-dtw approach for continuous sign language recognition,” in Proceedings of international conference on internet multimedia computing and service, 2014, pp. 237–240.
- L.-C. Wang, R. Wang, D.-H. Kong, and B.-C. Yin, “Similarity assessment model for chinese sign language videos,” IEEE Transactions on Multimedia, vol. 16, no. 3, pp. 751–761, 2014.
- O. Koller, S. Zargaran, H. Ney, and R. Bowden, “Deep sign: Enabling robust statistical continuous sign language recognition via hybrid cnn-hmms,” International Journal of Computer Vision, vol. 126, pp. 1311–1325, 2018.
- O. Koller, N. C. Camgoz, H. Ney, and R. Bowden, “Weakly supervised learning with multi-stream cnn-lstm-hmms to discover sequential parallelism in sign language videos,” IEEE transactions on pattern analysis and machine intelligence, vol. 42, no. 9, pp. 2306–2320, 2019.
- L. Gao, H. Li, Z. Liu, Z. Liu, L. Wan, and W. Feng, “Rnn-transducer based chinese sign language recognition,” Neurocomputing, vol. 434, pp. 45–54, 2021.
- D. Guo, W. Zhou, A. Li, H. Li, and M. Wang, “Hierarchical recurrent deep fusion using adaptive clip summarization for sign language translation,” IEEE Transactions on Image Processing, vol. 29, pp. 1575–1590, 2019.
- C. Wei, J. Zhao, W. Zhou, and H. Li, “Semantic boundary detection with reinforcement learning for continuous sign language recognition,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 31, no. 3, pp. 1138–1149, 2020.
- Q. Xiao, X. Chang, X. Zhang, and X. Liu, “Multi-information spatial–temporal lstm fusion continuous sign language neural machine translation,” Ieee Access, vol. 8, pp. 216 718–216 728, 2020.
- K. Papadimitriou and G. Potamianos, “Multimodal sign language recognition via temporal deformable convolutional sequence learning.” in Interspeech, 2020, pp. 2752–2756.
- R. Cui, H. Liu, and C. Zhang, “A deep neural framework for continuous sign language recognition by iterative training,” IEEE Transactions on Multimedia, vol. 21, no. 7, pp. 1880–1891, 2019.
- J. Fu, J. Gao, and C. Xu, “Learning semantic-aware spatial-temporal attention for interpretable action recognition,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 8, pp. 5213–5224, 2021.
- Y. Zhao, Z. Li, X. Guo, and Y. Lu, “Alignment-guided temporal attention for video action recognition,” Advances in Neural Information Processing Systems, vol. 35, pp. 13 627–13 639, 2022.
- J. Hu and L. Ni, “Transformer with sequence relative position for continuous sign language translation,” in International Conference on Advanced Algorithms and Neural Networks (AANN 2022), vol. 12285. SPIE, 2022, pp. 170–176.
- B. Jiang, M. Wang, W. Gan, W. Wu, and J. Yan, “Stm: Spatiotemporal and motion encoding for action recognition,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 2000–2009.
- A. Diba, M. Fayyaz, V. Sharma, M. M. Arzani, R. Yousefzadeh, J. Gall, and L. Van Gool, “Spatio-temporal channel correlation networks for action classification,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 284–299.
- H. Wang, D. Tran, L. Torresani, and M. Feiszli, “Video modeling with correlation networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 352–361.
- C. Buciluǎ, R. Caruana, and A. Niculescu-Mizil, “Model compression,” in Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, 2006, pp. 535–541.
- G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.
- Q. Zhu, J. Li, F. Yuan, and Q. Gan, “Continuous sign language recognition based on cross-resolution knowledge distillation,” arXiv preprint arXiv:2303.06820, 2023.
- N. Komodakis and S. Zagoruyko, “Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer,” in ICLR, 2017.
- Y. Hou, Z. Ma, C. Liu, and C. C. Loy, “Learning lightweight lane detection cnns by self attention distillation,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 1013–1021.
- L. Zhang, J. Song, A. Gao, J. Chen, C. Bao, and K. Ma, “Be your own teacher: Improve the performance of convolutional neural networks via self distillation,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 3713–3722.
- M. Ji, S. Shin, S. Hwang, G. Park, and I.-C. Moon, “Refine myself by teaching myself: Feature refinement via self-knowledge distillation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 10 664–10 673.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
- J. Huang, W. Zhou, Q. Zhang, H. Li, and W. Li, “Video-based sign language recognition without temporal segmentation,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32, no. 1, 2018.
- O. Koller, S. Zargaran, and H. Ney, “Re-sign: Re-aligned end-to-end sequence modelling with deep recurrent cnn-hmms,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4297–4305.
- N. C. Camgoz, O. Koller, S. Hadfield, and R. Bowden, “Sign language transformers: Joint end-to-end sign language recognition and translation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 10 023–10 033.
- K. L. Cheng, Z. Yang, Q. Chen, and Y.-W. Tai, “Fully convolutional networks for continuous sign language recognition,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIV 16. Springer, 2020, pp. 697–714.
- L. Hu, L. Gao, Z. Liu, and W. Feng, “Self-emphasizing network for continuous sign language recognition,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 1, 2023, pp. 854–862.
- R. Zuo and B. Mak, “C2slr: Consistency-enhanced continuous sign language recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 5131–5140.
- W. Yin, Y. Hou, Z. Guo, and K. Liu, “Spatial temporal enhanced network for continuous sign language recognition,” IEEE Transactions on Circuits and Systems for Video Technology, 2023.
- J. Kan, K. Hu, M. Hagenbuchner, A. C. Tsoi, M. Bennamoun, and Z. Wang, “Sign language translation with hierarchical spatio-temporal graph neural network,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2022, pp. 3367–3376.
- L. Guo, W. Xue, Q. Guo, Y. Zhou, T. Yuan, and S. Chen, “Conditional diffusion feature refinement for continuous sign language recognition,” arXiv preprint arXiv:2305.03614, 2023.
- D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
- R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-cam: Visual explanations from deep networks via gradient-based localization,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 618–626.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.