Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 82 tok/s
Gemini 2.5 Pro 61 tok/s Pro
GPT-5 Medium 35 tok/s Pro
GPT-5 High 36 tok/s Pro
GPT-4o 129 tok/s Pro
Kimi K2 212 tok/s Pro
GPT OSS 120B 474 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

CorrNet+: Sign Language Recognition and Translation via Spatial-Temporal Correlation (2404.11111v1)

Published 17 Apr 2024 in cs.CV

Abstract: In sign language, the conveyance of human body trajectories predominantly relies upon the coordinated movements of hands and facial expressions across successive frames. Despite the recent advancements of sign language understanding methods, they often solely focus on individual frames, inevitably overlooking the inter-frame correlations that are essential for effectively modeling human body trajectories. To address this limitation, this paper introduces a spatial-temporal correlation network, denoted as CorrNet+, which explicitly identifies body trajectories across multiple frames. In specific, CorrNet+ employs a correlation module and an identification module to build human body trajectories. Afterwards, a temporal attention module is followed to adaptively evaluate the contributions of different frames. The resultant features offer a holistic perspective on human body movements, facilitating a deeper understanding of sign language. As a unified model, CorrNet+ achieves new state-of-the-art performance on two extensive sign language understanding tasks, including continuous sign language recognition (CSLR) and sign language translation (SLT). Especially, CorrNet+ surpasses previous methods equipped with resource-intensive pose-estimation networks or pre-extracted heatmaps for hand and facial feature extraction. Compared with CorrNet, CorrNet+ achieves a significant performance boost across all benchmarks while halving the computational overhead. A comprehensive comparison with previous spatial-temporal reasoning methods verifies the superiority of CorrNet+. Code is available at https://github.com/hulianyuyy/CorrNet_Plus.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (82)
  1. P. Dreuw, D. Rybach, T. Deselaers, M. Zahedi, and H. Ney, “Speech recognition techniques for a sign language recognition system,” hand, vol. 60, p. 80, 2007.
  2. S. C. Ong and S. Ranganath, “Automatic sign language analysis: A survey and the future beyond lexical meaning,” IEEE Transactions on Pattern Analysis & Machine Intelligence, vol. 27, no. 06, pp. 873–891, 2005.
  3. N. Adaloglou, T. Chatzis, I. Papastratis, A. Stergioulas, G. T. Papadopoulos, V. Zacharopoulou, G. J. Xydopoulos, K. Atzakas, D. Papazachariou, and P. Daras, “A comprehensive study on deep learning-based methods for sign language recognition,” IEEE Transactions on Multimedia, vol. 24, pp. 1750–1762, 2021.
  4. R. Rastgoo, K. Kiani, and S. Escalera, “Sign language recognition: A deep survey,” Expert Systems with Applications, vol. 164, p. 113794, 2021.
  5. J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” in proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6299–6308.
  6. L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool, “Temporal segment networks for action recognition in videos,” IEEE transactions on pattern analysis and machine intelligence, vol. 41, no. 11, pp. 2740–2755, 2018.
  7. Z. Shou, D. Wang, and S.-F. Chang, “Temporal action localization in untrimmed videos via multi-stage cnns,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1049–1058.
  8. P. Weinzaepfel, Z. Harchaoui, and C. Schmid, “Learning to track for spatio-temporal action localization,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 3164–3172.
  9. L. Zhu, Z. Xu, Y. Yang, and A. G. Hauptmann, “Uncovering the temporal context for video question answering,” International Journal of Computer Vision, vol. 124, pp. 409–421, 2017.
  10. Y. Li, X. Wang, J. Xiao, W. Ji, and T.-S. Chua, “Invariant grounding for video question answering,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 2928–2937.
  11. R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-cam: Visual explanations from deep networks via gradient-based localization,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 618–626.
  12. K. L. Cheng, Z. Yang, Q. Chen, and Y.-W. Tai, “Fully convolutional networks for continuous sign language recognition,” in ECCV, 2020.
  13. R. Cui, H. Liu, and C. Zhang, “A deep neural framework for continuous sign language recognition by iterative training,” TMM, vol. 21, no. 7, pp. 1880–1891, 2019.
  14. Z. Niu and B. Mak, “Stochastic fine-grained labeling of multi-state sign glosses for continuous sign language recognition,” in ECCV, 2020.
  15. Y. Min, A. Hao, X. Chai, and X. Chen, “Visual alignment constraint for continuous sign language recognition,” in ICCV, 2021.
  16. R. Zuo and B. Mak, “C2slr: Consistency-enhanced continuous sign language recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 5131–5140.
  17. A. Hao, Y. Min, and X. Chen, “Self-mutual distillation learning for continuous sign language recognition,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 11 303–11 312.
  18. L. Hu, L. Gao, Z. Liu, and W. Feng, “Self-emphasizing network for continuous sign language recognition,” in Thirty-seventh AAAI conference on artificial intelligence, 2023.
  19. B. Zhou, Z. Chen, A. Clapés, J. Wan, Y. Liang, S. Escalera, Z. Lei, and D. Zhang, “Gloss-free sign language translation: Improving from visual-language pretraining,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 20 871–20 881.
  20. L. Guo, W. Xue, Q. Guo, B. Liu, K. Zhang, T. Yuan, and S. Chen, “Distilling cross-temporal contexts for continuous sign language recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 10 771–10 780.
  21. J. Zheng, Y. Wang, C. Tan, S. Li, G. Wang, J. Xia, Y. Chen, and S. Z. Li, “Cvt-slr: Contrastive visual-textual transformation for sign language recognition with variational alignment,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 23 141–23 150.
  22. Y. Chen, R. Zuo, F. Wei, Y. Wu, S. Liu, and B. Mak, “Two-stream network for sign language recognition and translation,” Advances in Neural Information Processing Systems, vol. 35, pp. 17 043–17 056, 2022.
  23. J. Lin, C. Gan, and S. Han, “Tsm: Temporal shift module for efficient video understanding,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 7083–7093.
  24. Z. Liu, D. Luo, Y. Wang, L. Wang, Y. Tai, C. Wang, J. Li, F. Huang, and T. Lu, “Teinet: Towards an efficient architecture for video recognition,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, 2020, pp. 11 669–11 676.
  25. H. Zhou, W. Zhou, Y. Zhou, and H. Li, “Spatial-temporal multi-cue network for continuous sign language recognition,” in AAAI, 2020.
  26. O. Koller, N. C. Camgoz, H. Ney, and R. Bowden, “Weakly supervised learning with multi-stream cnn-lstm-hmms to discover sequential parallelism in sign language videos,” PAMI, vol. 42, no. 9, pp. 2306–2320, 2019.
  27. O. Koller, J. Forster, and H. Ney, “Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling multiple signers,” Computer Vision and Image Understanding, vol. 141, pp. 108–125, 2015.
  28. N. C. Camgoz, S. Hadfield, O. Koller, H. Ney, and R. Bowden, “Neural sign language translation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7784–7793.
  29. H. Zhou, W. Zhou, W. Qi, J. Pu, and H. Li, “Improving sign language translation with monolingual data by sign back-translation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 1316–1325.
  30. L. Hu, L. Gao, Z. Liu, and W. Feng, “Continuous sign language recognition with correlation network,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 2529–2539.
  31. W. Gao, G. Fang, D. Zhao, and Y. Chen, “A chinese sign language recognition system based on sofm/srn/hmm,” Pattern Recognition, vol. 37, no. 12, pp. 2389–2402, 2004.
  32. W. T. Freeman and M. Roth, “Orientation histograms for hand gesture recognition,” in International workshop on automatic face and gesture recognition, vol. 12.   Zurich, Switzerland, 1995, pp. 296–301.
  33. O. Koller, O. Zargaran, H. Ney, and R. Bowden, “Deep sign: Hybrid cnn-hmm for continuous sign language recognition,” in Proceedings of the British Machine Vision Conference 2016, 2016.
  34. J. Han, G. Awad, and A. Sutherland, “Modelling and segmenting subunits for sign language recognition based on hand motion analysis,” Pattern Recognition Letters, vol. 30, no. 6, pp. 623–633, 2009.
  35. O. Koller, S. Zargaran, and H. Ney, “Re-sign: Re-aligned end-to-end sequence modelling with deep recurrent cnn-hmms,” in CVPR, 2017.
  36. A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proceedings of the 23rd international conference on Machine learning, 2006, pp. 369–376.
  37. J. Pu, W. Zhou, and H. Li, “Iterative alignment network for continuous sign language recognition,” in CVPR, 2019.
  38. J. Pu, W. Zhou, H. Hu, and H. Li, “Boosting continuous sign language recognition via cross modality augmentation,” in ACM MM, 2020.
  39. Y. Min, P. Jiao, Y. Li, X. Wang, L. Lei, X. Chai, and X. Chen, “Deep radial embedding for visual sequence learning,” in Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part VI.   Springer, 2022, pp. 240–256.
  40. L. Hu, L. Gao, Z. Liu, and W. Feng, “Temporal lift pooling for continuous sign language recognition,” in Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV.   Springer, 2022, pp. 511–527.
  41. D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv preprint arXiv:1409.0473, 2014.
  42. N. C. Camgoz, O. Koller, S. Hadfield, and R. Bowden, “Sign language transformers: Joint end-to-end sign language recognition and translation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 10 023–10 033.
  43. Y. Chen, F. Wei, X. Sun, Z. Wu, and S. Lin, “A simple multi-modality transfer learning baseline for sign language translation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 5120–5130.
  44. J. Ye, W. Jiao, X. Wang, Z. Tu, and H. Xiong, “Cross-modality data augmentation for end-to-end sign language translation,” arXiv preprint arXiv:2305.11096, 2023.
  45. D. Zhu, V. Czehmann, and E. Avramidis, “Neural machine translation methods for translating text to sign language glosses,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023, pp. 12 523–12 541.
  46. K. Lin, X. Wang, L. Zhu, K. Sun, B. Zhang, and Y. Yang, “Gloss-free end-to-end sign language translation,” arXiv preprint arXiv:2305.12876, 2023.
  47. I. Rocco, R. Arandjelovic, and J. Sivic, “Convolutional neural network architecture for geometric matching,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 6148–6157.
  48. C. Feichtenhofer, A. Pinz, and A. Zisserman, “Detect to track and track to detect,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 3038–3046.
  49. P. Weinzaepfel, J. Revaud, Z. Harchaoui, and C. Schmid, “Deepflow: Large displacement optical flow with deep matching,” in Proceedings of the IEEE international conference on computer vision, 2013, pp. 1385–1392.
  50. A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, P. Van Der Smagt, D. Cremers, and T. Brox, “Flownet: Learning optical flow with convolutional networks,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 2758–2766.
  51. D. Sun, X. Yang, M.-Y. Liu, and J. Kautz, “Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 8934–8943.
  52. X. Shi, Z. Huang, W. Bian, D. Li, M. Zhang, K. C. Cheung, S. See, H. Qin, J. Dai, and H. Li, “Videoflow: Exploiting temporal cues for multi-frame optical flow estimation,” arXiv preprint arXiv:2303.08340, 2023.
  53. X. Shi, Z. Huang, D. Li, M. Zhang, K. C. Cheung, S. See, H. Qin, J. Dai, and H. Li, “Flowformer++: Masked cost volume autoencoding for pretraining optical flow estimation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 1599–1610.
  54. Y. Zhao, Y. Xiong, and D. Lin, “Recognize actions by disentangling components of dynamics,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6566–6575.
  55. A. Diba, M. Fayyaz, V. Sharma, M. M. Arzani, R. Yousefzadeh, J. Gall, and L. Van Gool, “Spatio-temporal channel correlation networks for action classification,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 284–299.
  56. J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7132–7141.
  57. M. Lee, S. Lee, S. Son, G. Park, and N. Kwak, “Motion feature network: Fixed motion filter for action recognition,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 387–403.
  58. H. Wang, D. Tran, L. Torresani, and M. Feiszli, “Video modeling with correlation networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 352–361.
  59. Y. Xu, H. Cao, K. Mao, Z. Chen, L. Xie, and J. Yang, “Aligning correlation information for domain adaptation in action recognition,” IEEE Transactions on Neural Networks and Learning Systems, 2022.
  60. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  61. J. Huang, W. Zhou, Q. Zhang, H. Li, and W. Li, “Video-based sign language recognition without temporal segmentation,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32, no. 1, 2018.
  62. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  63. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition.   Ieee, 2009, pp. 248–255.
  64. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  65. Y. Liu, J. Gu, N. Goyal, X. Li, S. Edunov, M. Ghazvininejad, M. Lewis, and L. Zettlemoyer, “Multilingual denoising pre-training for neural machine translation,” Transactions of the Association for Computational Linguistics, vol. 8, pp. 726–742, 2020.
  66. K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” in Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 2002, pp. 311–318.
  67. C.-Y. Lin, “Rouge: A package for automatic evaluation of summaries,” in Text summarization branches out, 2004, pp. 74–81.
  68. N. Ma, X. Zhang, H.-T. Zheng, and J. Sun, “Shufflenet v2: Practical guidelines for efficient cnn architecture design,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 116–131.
  69. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9.
  70. I. Radosavovic, R. P. Kosaraju, R. Girshick, K. He, and P. Dollár, “Designing network design spaces,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 10 428–10 436.
  71. S. Woo, J. Park, J.-Y. Lee, and I. S. Kweon, “Cbam: Convolutional block attention module,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 3–19.
  72. X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7794–7803.
  73. D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri, “A closer look at spatiotemporal convolutions for action recognition,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2018, pp. 6450–6459.
  74. P. Jiao, Y. Min, Y. Li, X. Wang, L. Lei, and X. Chen, “Cosign: Exploring co-occurrence signals in skeleton-based continuous sign language recognition,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 20 676–20 686.
  75. L. Hu, L. Gao, Z. Liu, and W. Feng, “Scalable frame resolution for efficient continuous sign language recognition,” Pattern Recognition, vol. 145, p. 109903, 2024.
  76. L. Hu, L. Gao, Z. Liu, C.-M. Pun, and W. Feng, “Adabrowse: Adaptive video browser for efficient continuous sign language recognition,” in Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 709–718.
  77. N. Cihan Camgoz, S. Hadfield, O. Koller, and R. Bowden, “Subunets: End-to-end hand shape and continuous sign language recognition,” in ICCV, 2017.
  78. Z. Yang, Z. Shi, X. Shen, and Y.-W. Tai, “Sf-net: Structured feature network for continuous sign language recognition,” arXiv preprint arXiv:1908.01341, 2019.
  79. J. Wang, K. Sun, T. Cheng, B. Jiang, C. Deng, Y. Zhao, D. Liu, Y. Mu, M. Tan, X. Wang et al., “Deep high-resolution representation learning for visual recognition,” IEEE transactions on pattern analysis and machine intelligence, vol. 43, no. 10, pp. 3349–3364, 2020.
  80. K. Yin and J. Read, “Better sign language translation with stmc-transformer,” arXiv preprint arXiv:2004.00588, 2020.
  81. B. Zhang, M. Müller, and R. Sennrich, “SLTUNET: A simple unified model for sign language translation,” in The Eleventh International Conference on Learning Representations, 2023. [Online]. Available: https://openreview.net/forum?id=EBS4C77p_5S
  82. H. Zhou, W. Zhou, Y. Zhou, and H. Li, “Spatial-temporal multi-cue network for sign language recognition and translation,” IEEE Transactions on Multimedia, vol. 24, pp. 768–779, 2021.
Citations (3)

Summary

We haven't generated a summary for this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 1 post and received 0 likes.