Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SkelCap: Automated Generation of Descriptive Text from Skeleton Keypoint Sequences (2405.02977v1)

Published 5 May 2024 in cs.CV and cs.LG

Abstract: Numerous sign language datasets exist, yet they typically cover only a limited selection of the thousands of signs used globally. Moreover, creating diverse sign language datasets is an expensive and challenging task due to the costs associated with gathering a varied group of signers. Motivated by these challenges, we aimed to develop a solution that addresses these limitations. In this context, we focused on textually describing body movements from skeleton keypoint sequences, leading to the creation of a new dataset. We structured this dataset around AUTSL, a comprehensive isolated Turkish sign language dataset. We also developed a baseline model, SkelCap, which can generate textual descriptions of body movements. This model processes the skeleton keypoints data as a vector, applies a fully connected layer for embedding, and utilizes a transformer neural network for sequence-to-sequence modeling. We conducted extensive evaluations of our model, including signer-agnostic and sign-agnostic assessments. The model achieved promising results, with a ROUGE-L score of 0.98 and a BLEU-4 score of 0.94 in the signer-agnostic evaluation. The dataset we have prepared, namely the AUTSL-SkelCap, will be made publicly available soon.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. Engelli ve yasli istatistik Bulteni Temmuz 2021.   Republic of Turkey Ministry of Family and Social Services, 2021. [Online]. Available: https://www.aile.gov.tr/media/88684/eyhgm_istatistik_bulteni_temmuz2021.pdf
  2. O. Özdemir, A. A. Kındıroğlu, N. Cihan Camgoz, and L. Akarun, “BosphorusSign22k Sign Language Recognition Dataset,” in Proceedings of the LREC2020 9th Workshop on the Representation and Processing of Sign Languages: Sign Language Resources in the Service of the Language Community, Technological Challenges and Application Perspectives, 2020.
  3. O. Koller, J. Forster, and H. Ney, “Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling multiple signers,” Computer Vision and Image Understanding, vol. 141, pp. 108–125, 2015.
  4. A. Voskou, K. P. Panousis, H. Partaourides, K. Tolias, and S. Chatzis, “A new dataset for end-to-end sign language translation: The greek elementary school dataset,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 1966–1975.
  5. O. M. Sincan and H. Y. Keles, “Autsl: A large scale multi-modal turkish sign language dataset and baseline methods,” IEEE access, vol. 8, pp. 181 340–181 355, 2020.
  6. O. BOZDEMİR, “Öğrenci̇leri̇n görüşleri̇ne göre türkçe keli̇me dağarciğinin azalmasi sorunu ve çözüm öneri̇leri̇,” EKEV Akademi Dergisi, no. 82, pp. 307–320, 2020.
  7. A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang, “Ntu rgb+ d: A large scale dataset for 3d human activity analysis,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1010–1019.
  8. W. Zhang, M. Zhu, and K. G. Derpanis, “From actemes to action: A strongly-supervised representation for detailed action understanding,” in Proceedings of the IEEE international conference on computer vision, 2013, pp. 2248–2255.
  9. M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele, “2d human pose estimation: New benchmark and state of the art analysis,” in Proceedings of the IEEE Conference on computer Vision and Pattern Recognition, 2014, pp. 3686–3693.
  10. G. A. Sigurdsson, G. Varol, X. Wang, A. Farhadi, I. Laptev, and A. Gupta, “Hollywood in homes: Crowdsourcing data collection for activity understanding,” in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14.   Springer, 2016, pp. 510–526.
  11. C. Gu, C. Sun, D. A. Ross, C. Vondrick, C. Pantofaru, Y. Li, S. Vijayanarasimhan, G. Toderici, S. Ricco, R. Sukthankar et al., “Ava: A video dataset of spatio-temporally localized atomic visual actions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 6047–6056.
  12. A. O. Tur and H. Y. Keles, “Isolated sign recognition with a siamese neural network of rgb and depth streams,” in IEEE EUROCON 2019-18th International Conference on Smart Technologies.   IEEE, 2019, pp. 1–6.
  13. O. M. Sincan, A. O. Tur, and H. Y. Keles, “Isolated sign language recognition with multi-scale features using lstm,” in 2019 27th signal processing and communications applications conference (SIU).   IEEE, 2019, pp. 1–4.
  14. A. O. Tur and H. Y. Keles, “Evaluation of hidden markov models using deep cnn features in isolated sign recognition,” Multimedia Tools and Applications, vol. 80, pp. 19 137–19 155, 2021.
  15. M. Boháček and M. Hrúz, “Sign pose-based transformer for word-level sign language recognition,” in Proceedings of the IEEE/CVF winter conference on applications of computer vision, 2022, pp. 182–191.
  16. S. Jiang, B. Sun, L. Wang, Y. Bai, K. Li, and Y. Fu, “Skeleton aware multi-modal sign language recognition,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 3413–3423.
  17. Y. C. Bilge, N. Ikizler-Cinbis, and R. G. Cinbis, “Zero-shot sign language recognition: Can textual data uncover sign languages?” arXiv preprint arXiv:1907.10292, 2019.
  18. Y. C. Bilge, R. G. Cinbis, and N. Ikizler-Cinbis, “Towards zero-shot sign language recognition,” IEEE transactions on pattern analysis and machine intelligence, vol. 45, no. 1, pp. 1217–1232, 2022.
  19. G. S. Özcan, E. SÜMER, and Y. C. BİLGE, “Enhancing zero-shot learning based sign language recognition through hand landmarks and data augmentation,” Journal of Millimeterwave Communication, Optimization and Modelling, vol. 4, no. 1, pp. 17–20, 2024.
  20. N. C. Camgoz, S. Hadfield, O. Koller, H. Ney, and R. Bowden, “Neural sign language translation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7784–7793.
  21. N. C. Camgoz, O. Koller, S. Hadfield, and R. Bowden, “Sign language transformers: Joint end-to-end sign language recognition and translation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 10 023–10 033.
  22. S. Gan, Y. Yin, Z. Jiang, L. Xie, and S. Lu, “Skeleton-aware neural sign language translation,” in Proceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 4353–4361.
  23. N. C. Camgoz, O. Koller, S. Hadfield, and R. Bowden, “Multi-channel transformers for multi-articulatory sign language translation,” in Computer Vision–ECCV 2020 Workshops: Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16.   Springer, 2020, pp. 301–319.
  24. D. Guo, W. Zhou, A. Li, H. Li, and M. Wang, “Hierarchical recurrent deep fusion using adaptive clip summarization for sign language translation,” IEEE Transactions on Image Processing, vol. 29, pp. 1575–1590, 2019.
  25. N. C. Camgöz, B. Saunders, G. Rochette, M. Giovanelli, G. Inches, R. Nachtrab-Ribback, and R. Bowden, “Content4all open research sign language translation datasets,” in 2021 16th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2021).   IEEE, 2021, pp. 1–5.
  26. Z. Jiang, M. Müller, S. Ebling, A. Moryossef, and R. Ribback, “Srf dsgs daily news broadcast: video and original subtitle data,” 2023.
  27. H. Vaezi Joze and O. Koller, “Ms-asl: A large-scale data set and benchmark for understanding american sign language,” in The British Machine Vision Conference (BMVC), September 2019. [Online]. Available: https://www.microsoft.com/en-us/research/publication/ms-asl-a-large-scale-data-set-and-benchmark-for-understanding-american-sign-language/
  28. Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime multi-person 2d pose estimation using part affinity fields,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 7291–7299.
  29. I. Grishchenko, V. Bazarevsky, A. Zanfir, E. G. Bazavan, M. Zanfir, R. Yee, K. Raveendran, M. Zhdanovich, M. Grundmann, and C. Sminchisescu, “Blazepose ghum holistic: Real-time 3d human landmarks and pose estimation,” arXiv preprint arXiv:2206.11678, 2022.
  30. “Pose landmark detection guide,” 2023. [Online]. Available: https://developers.google.com/mediapipe/solutions/vision/pose_landmarker
  31. “Hand landmarks detection guide,” 2023. [Online]. Available: https://developers.google.com/mediapipe/solutions/vision/hand_landmarker
  32. S. Stoll, N. C. Camgöz, S. Hadfield, and R. Bowden, “Sign language production using neural machine translation and generative adversarial networks,” in Proceedings of the 29th British Machine Vision Conference (BMVC 2018).   British Machine Vision Association, 2018.
  33. C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” Journal of machine learning research, vol. 21, no. 140, pp. 1–67, 2020.
  34. “google/mt5-base,” 2020. [Online]. Available: https://huggingface.co/google/mt5-base
  35. P. K. Diederik, “Adam: A method for stochastic optimization,” 2014.
  36. C.-Y. Lin, “Rouge: A package for automatic evaluation of summaries,” in Text summarization branches out, 2004, pp. 74–81.
  37. K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” in Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 2002, pp. 311–318.

Summary

We haven't generated a summary for this paper yet.