SkelCap: Automated Generation of Descriptive Text from Skeleton Keypoint Sequences (2405.02977v1)
Abstract: Numerous sign language datasets exist, yet they typically cover only a limited selection of the thousands of signs used globally. Moreover, creating diverse sign language datasets is an expensive and challenging task due to the costs associated with gathering a varied group of signers. Motivated by these challenges, we aimed to develop a solution that addresses these limitations. In this context, we focused on textually describing body movements from skeleton keypoint sequences, leading to the creation of a new dataset. We structured this dataset around AUTSL, a comprehensive isolated Turkish sign language dataset. We also developed a baseline model, SkelCap, which can generate textual descriptions of body movements. This model processes the skeleton keypoints data as a vector, applies a fully connected layer for embedding, and utilizes a transformer neural network for sequence-to-sequence modeling. We conducted extensive evaluations of our model, including signer-agnostic and sign-agnostic assessments. The model achieved promising results, with a ROUGE-L score of 0.98 and a BLEU-4 score of 0.94 in the signer-agnostic evaluation. The dataset we have prepared, namely the AUTSL-SkelCap, will be made publicly available soon.
- Engelli ve yasli istatistik Bulteni Temmuz 2021. Republic of Turkey Ministry of Family and Social Services, 2021. [Online]. Available: https://www.aile.gov.tr/media/88684/eyhgm_istatistik_bulteni_temmuz2021.pdf
- O. Özdemir, A. A. Kındıroğlu, N. Cihan Camgoz, and L. Akarun, “BosphorusSign22k Sign Language Recognition Dataset,” in Proceedings of the LREC2020 9th Workshop on the Representation and Processing of Sign Languages: Sign Language Resources in the Service of the Language Community, Technological Challenges and Application Perspectives, 2020.
- O. Koller, J. Forster, and H. Ney, “Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling multiple signers,” Computer Vision and Image Understanding, vol. 141, pp. 108–125, 2015.
- A. Voskou, K. P. Panousis, H. Partaourides, K. Tolias, and S. Chatzis, “A new dataset for end-to-end sign language translation: The greek elementary school dataset,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 1966–1975.
- O. M. Sincan and H. Y. Keles, “Autsl: A large scale multi-modal turkish sign language dataset and baseline methods,” IEEE access, vol. 8, pp. 181 340–181 355, 2020.
- O. BOZDEMİR, “Öğrenci̇leri̇n görüşleri̇ne göre türkçe keli̇me dağarciğinin azalmasi sorunu ve çözüm öneri̇leri̇,” EKEV Akademi Dergisi, no. 82, pp. 307–320, 2020.
- A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang, “Ntu rgb+ d: A large scale dataset for 3d human activity analysis,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1010–1019.
- W. Zhang, M. Zhu, and K. G. Derpanis, “From actemes to action: A strongly-supervised representation for detailed action understanding,” in Proceedings of the IEEE international conference on computer vision, 2013, pp. 2248–2255.
- M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele, “2d human pose estimation: New benchmark and state of the art analysis,” in Proceedings of the IEEE Conference on computer Vision and Pattern Recognition, 2014, pp. 3686–3693.
- G. A. Sigurdsson, G. Varol, X. Wang, A. Farhadi, I. Laptev, and A. Gupta, “Hollywood in homes: Crowdsourcing data collection for activity understanding,” in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14. Springer, 2016, pp. 510–526.
- C. Gu, C. Sun, D. A. Ross, C. Vondrick, C. Pantofaru, Y. Li, S. Vijayanarasimhan, G. Toderici, S. Ricco, R. Sukthankar et al., “Ava: A video dataset of spatio-temporally localized atomic visual actions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 6047–6056.
- A. O. Tur and H. Y. Keles, “Isolated sign recognition with a siamese neural network of rgb and depth streams,” in IEEE EUROCON 2019-18th International Conference on Smart Technologies. IEEE, 2019, pp. 1–6.
- O. M. Sincan, A. O. Tur, and H. Y. Keles, “Isolated sign language recognition with multi-scale features using lstm,” in 2019 27th signal processing and communications applications conference (SIU). IEEE, 2019, pp. 1–4.
- A. O. Tur and H. Y. Keles, “Evaluation of hidden markov models using deep cnn features in isolated sign recognition,” Multimedia Tools and Applications, vol. 80, pp. 19 137–19 155, 2021.
- M. Boháček and M. Hrúz, “Sign pose-based transformer for word-level sign language recognition,” in Proceedings of the IEEE/CVF winter conference on applications of computer vision, 2022, pp. 182–191.
- S. Jiang, B. Sun, L. Wang, Y. Bai, K. Li, and Y. Fu, “Skeleton aware multi-modal sign language recognition,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 3413–3423.
- Y. C. Bilge, N. Ikizler-Cinbis, and R. G. Cinbis, “Zero-shot sign language recognition: Can textual data uncover sign languages?” arXiv preprint arXiv:1907.10292, 2019.
- Y. C. Bilge, R. G. Cinbis, and N. Ikizler-Cinbis, “Towards zero-shot sign language recognition,” IEEE transactions on pattern analysis and machine intelligence, vol. 45, no. 1, pp. 1217–1232, 2022.
- G. S. Özcan, E. SÜMER, and Y. C. BİLGE, “Enhancing zero-shot learning based sign language recognition through hand landmarks and data augmentation,” Journal of Millimeterwave Communication, Optimization and Modelling, vol. 4, no. 1, pp. 17–20, 2024.
- N. C. Camgoz, S. Hadfield, O. Koller, H. Ney, and R. Bowden, “Neural sign language translation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7784–7793.
- N. C. Camgoz, O. Koller, S. Hadfield, and R. Bowden, “Sign language transformers: Joint end-to-end sign language recognition and translation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 10 023–10 033.
- S. Gan, Y. Yin, Z. Jiang, L. Xie, and S. Lu, “Skeleton-aware neural sign language translation,” in Proceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 4353–4361.
- N. C. Camgoz, O. Koller, S. Hadfield, and R. Bowden, “Multi-channel transformers for multi-articulatory sign language translation,” in Computer Vision–ECCV 2020 Workshops: Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16. Springer, 2020, pp. 301–319.
- D. Guo, W. Zhou, A. Li, H. Li, and M. Wang, “Hierarchical recurrent deep fusion using adaptive clip summarization for sign language translation,” IEEE Transactions on Image Processing, vol. 29, pp. 1575–1590, 2019.
- N. C. Camgöz, B. Saunders, G. Rochette, M. Giovanelli, G. Inches, R. Nachtrab-Ribback, and R. Bowden, “Content4all open research sign language translation datasets,” in 2021 16th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2021). IEEE, 2021, pp. 1–5.
- Z. Jiang, M. Müller, S. Ebling, A. Moryossef, and R. Ribback, “Srf dsgs daily news broadcast: video and original subtitle data,” 2023.
- H. Vaezi Joze and O. Koller, “Ms-asl: A large-scale data set and benchmark for understanding american sign language,” in The British Machine Vision Conference (BMVC), September 2019. [Online]. Available: https://www.microsoft.com/en-us/research/publication/ms-asl-a-large-scale-data-set-and-benchmark-for-understanding-american-sign-language/
- Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime multi-person 2d pose estimation using part affinity fields,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 7291–7299.
- I. Grishchenko, V. Bazarevsky, A. Zanfir, E. G. Bazavan, M. Zanfir, R. Yee, K. Raveendran, M. Zhdanovich, M. Grundmann, and C. Sminchisescu, “Blazepose ghum holistic: Real-time 3d human landmarks and pose estimation,” arXiv preprint arXiv:2206.11678, 2022.
- “Pose landmark detection guide,” 2023. [Online]. Available: https://developers.google.com/mediapipe/solutions/vision/pose_landmarker
- “Hand landmarks detection guide,” 2023. [Online]. Available: https://developers.google.com/mediapipe/solutions/vision/hand_landmarker
- S. Stoll, N. C. Camgöz, S. Hadfield, and R. Bowden, “Sign language production using neural machine translation and generative adversarial networks,” in Proceedings of the 29th British Machine Vision Conference (BMVC 2018). British Machine Vision Association, 2018.
- C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” Journal of machine learning research, vol. 21, no. 140, pp. 1–67, 2020.
- “google/mt5-base,” 2020. [Online]. Available: https://huggingface.co/google/mt5-base
- P. K. Diederik, “Adam: A method for stochastic optimization,” 2014.
- C.-Y. Lin, “Rouge: A package for automatic evaluation of summaries,” in Text summarization branches out, 2004, pp. 74–81.
- K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” in Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 2002, pp. 311–318.