Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
162 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MM-WLAuslan: Multi-View Multi-Modal Word-Level Australian Sign Language Recognition Dataset (2410.19488v1)

Published 25 Oct 2024 in cs.CV

Abstract: Isolated Sign Language Recognition (ISLR) focuses on identifying individual sign language glosses. Considering the diversity of sign languages across geographical regions, developing region-specific ISLR datasets is crucial for supporting communication and research. Auslan, as a sign language specific to Australia, still lacks a dedicated large-scale word-level dataset for the ISLR task. To fill this gap, we curate \underline{\textbf{the first}} large-scale Multi-view Multi-modal Word-Level Australian Sign Language recognition dataset, dubbed MM-WLAuslan. Compared to other publicly available datasets, MM-WLAuslan exhibits three significant advantages: (1) the largest amount of data, (2) the most extensive vocabulary, and (3) the most diverse of multi-modal camera views. Specifically, we record 282K+ sign videos covering 3,215 commonly used Auslan glosses presented by 73 signers in a studio environment. Moreover, our filming system includes two different types of cameras, i.e., three Kinect-V2 cameras and a RealSense camera. We position cameras hemispherically around the front half of the model and simultaneously record videos using all four cameras. Furthermore, we benchmark results with state-of-the-art methods for various multi-modal ISLR settings on MM-WLAuslan, including multi-view, cross-camera, and cross-view. Experiment results indicate that MM-WLAuslan is a challenging ISLR dataset, and we hope this dataset will contribute to the development of Auslan and the advancement of sign languages worldwide. All datasets and benchmarks are available at MM-WLAuslan.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (82)
  1. Getting the upper hand on sign language families: Historical analysis and annotation methods. FEAST. Formal and Experimental Advances in Sign language Theory, 3:17–29, 2020.
  2. William C Stokoe Jr. Sign language structure: An outline of the visual communication systems of the american deaf. Journal of deaf studies and deaf education, 10(1):3–37, 2005.
  3. William C Stokoe. Sign language structure. Annual review of anthropology, 9(1):365–390, 1980.
  4. Word-level deep sign language recognition from video: A new large-scale dataset and methods comparison. In The IEEE Winter Conference on Applications of Computer Vision, pages 1459–1469, 2020.
  5. Auslan-daily: Australian sign language translation for daily communication and news. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023.
  6. Quo vadis, action recognition? A new model and the kinetics dataset. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 4724–4733. IEEE Computer Society, 2017.
  7. A closer look at spatiotemporal convolutions for action recognition. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 6450–6459. Computer Vision Foundation / IEEE Computer Society, 2018.
  8. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In Vittorio Ferrari, Martial Hebert, Cristian Sminchisescu, and Yair Weiss, editors, Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part XV, volume 11219 of Lecture Notes in Computer Science, pages 318–335. Springer, 2018.
  9. Slowfast networks for video recognition. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, pages 6201–6210. IEEE, 2019.
  10. Hamid Reza Vaezi Joze and Oscar Koller. MS-ASL: A large-scale data set and benchmark for understanding american sign language. In 30th British Machine Vision Conference 2019, BMVC 2019, Cardiff, UK, September 9-12, 2019, page 100. BMVA Press, 2019.
  11. Attention-based 3d-cnns for large-vocabulary sign language recognition. IEEE Transactions on Circuits and Systems for Video Technology, 29(9):2822–2832, 2018.
  12. Kinect gesture dataset. https://www.microsoft.com/en-us/download/details.aspx?id=52283, 2019. Accessed: 2019-07-16.
  13. Lsa64: an argentinian sign language dataset. arXiv preprint arXiv:2310.17429, 2023.
  14. Natural language-assisted sign language recognition. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 14890–14900. IEEE, 2023.
  15. Skeleton aware multi-modal sign language recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2021.
  16. Purdue RVL-SLLL ASL database for automatic recognition of american sign language. In 4th IEEE International Conference on Multimodal Interfaces (ICMI 2002), 14-16 October 2002, Pittsburgh, PA, USA, pages 167–172. IEEE Computer Society, 2002.
  17. The american sign language lexicon video dataset. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR Workshops 2008, Anchorage, AK, USA, 23-28 June, 2008, pages 1–8. IEEE Computer Society, 2008.
  18. Popsign ASL v1.0: An isolated american sign language dataset collected via smartphones. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023.
  19. ASL citizen: A community-sourced dataset for advancing isolated sign language recognition. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023.
  20. BSL-1K: scaling up co-articulated sign language recognition using mouthing cues. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XI, volume 12356 of Lecture Notes in Computer Science, pages 35–53. Springer, 2020.
  21. Watch, read and lookup: learning to spot signs from multiple supervisors. In ACCV, 2020.
  22. Improving sign language translation with monolingual data by sign back-translation. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pages 1316–1325. Computer Vision Foundation / IEEE, 2021.
  23. Sign language recognition and translation with kinect. In IEEE conf. on AFGR, volume 655, page 4, 2013.
  24. SMILE swiss german sign language dataset. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation, LREC 2018, Miyazaki, Japan, May 7-12, 2018. European Language Resources Association (ELRA), 2018.
  25. Ai empowered auslan learning for parents of deaf children and children of deaf adults. AI and Ethics, pages 1–11, 2024.
  26. Automatic gloss dictionary for sign language learners. In Valerio Basile, Zornitsa Kozareva, and Sanja Stajner, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, ACL 2022 - System Demonstrations, Dublin, Ireland, May 22-27, 2022, pages 83–92. Association for Computational Linguistics, 2022.
  27. Combination of tangent distance and an image distortion model for appearance-based sign language recognition. In Walter G. Kropatsch, Robert Sablatnig, and Allan Hanbury, editors, Pattern Recognition, 27th DAGM Symposium, Vienna, Austria, August 31 - September 2, 2005, Proceedings, volume 3663 of Lecture Notes in Computer Science, pages 401–408. Springer, 2005.
  28. Multiview meta-metric learning for sign language recognition using triplet loss embeddings. Pattern Anal. Appl., 26(3):1125–1141, 2023.
  29. Signbank: Software to support web based dictionaries of sign language. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation, LREC 2018, Miyazaki, Japan, May 7-12, 2018. European Language Resources Association (ELRA), 2018.
  30. Sign language recognition using sequential pattern trees. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, June 16-21, 2012, pages 2200–2207. IEEE Computer Society, 2012.
  31. Include: A large scale dataset for indian sign language recognition. MM ’20. Association for Computing Machinery, 2020.
  32. Lse-sign: A lexical database for spanish sign language. Behavior Research Methods, 48:123–137, 2016.
  33. LSFB-CONT and LSFB-ISOL: two new datasets for vision-based sign language recognition. In International Joint Conference on Neural Networks, IJCNN 2021, Shenzhen, China, July 18-22, 2021, pages 1–8. IEEE, 2021.
  34. Bosphorussign22k sign language recognition dataset. CoRR, abs/2004.01283, 2020.
  35. AUTSL: A large scale multi-modal turkish sign language dataset and baseline methods. IEEE Access, 8:181340–181355, 2020.
  36. Recurrent convolutional neural networks for continuous sign language recognition by staged optimization. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 1610–1618. IEEE Computer Society, 2017.
  37. A deep neural framework for continuous sign language recognition by iterative training. IEEE Trans. Multim., 21(7):1880–1891, 2019.
  38. Weakly supervised learning with multi-stream cnn-lstm-hmms to discover sequential parallelism in sign language videos. IEEE transactions on pattern analysis and machine intelligence, 42(9):2306–2320, 2019.
  39. Caffe: Convolutional architecture for fast feature embedding. In Kien A. Hua, Yong Rui, Ralf Steinmetz, Alan Hanjalic, Apostol Natsev, and Wenwu Zhu, editors, Proceedings of the ACM International Conference on Multimedia, MM ’14, Orlando, FL, USA, November 03 - 07, 2014, pages 675–678. ACM, 2014.
  40. Large-scale video classification with convolutional neural networks. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, June 23-28, 2014, pages 1725–1732. IEEE Computer Society, 2014.
  41. Learning spatiotemporal features with 3d convolutional networks. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, pages 4489–4497. IEEE Computer Society, 2015.
  42. Hand pose guided 3d pooling for word-level sign language recognition. In IEEE Winter Conference on Applications of Computer Vision, WACV 2021, Waikoloa, HI, USA, January 3-8, 2021, pages 3428–3438. IEEE, 2021.
  43. Sign spotting via multi-modal fusion and testing time transferring. In Leonid Karlinsky, Tomer Michaeli, and Ko Nishino, editors, Computer Vision - ECCV 2022 Workshops - Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part VIII, volume 13808 of Lecture Notes in Computer Science, pages 271–287. Springer, 2022.
  44. Transferring cross-domain knowledge for video sign language recognition. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pages 6204–6213. Computer Vision Foundation / IEEE, 2020.
  45. Learning to track for spatio-temporal action localization. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, pages 3164–3172. IEEE Computer Society, 2015.
  46. Skeleton-based action recognition with spatial reasoning and temporal stack learning. In Vittorio Ferrari, Martial Hebert, Cristian Sminchisescu, and Yair Weiss, editors, Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part I, volume 11205 of Lecture Notes in Computer Science, pages 106–121. Springer, 2018.
  47. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Sheila A. McIlraith and Kilian Q. Weinberger, editors, Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, pages 7444–7452. AAAI Press, 2018.
  48. DMMG: dual min-max games for self-supervised skeleton-based action recognition. IEEE Trans. Image Process., 33:395–407, 2024.
  49. Pose-based sign language recognition using GCN and BERT. In IEEE Winter Conference on Applications of Computer Vision Workshops, WACV Workshops 2021, Waikoloa, HI, USA, January 5-9, 2021, pages 31–40. IEEE, 2021.
  50. Dynamic spatial-temporal aggregation for skeleton-aware sign language recognition. In Nicoletta Calzolari, Min-Yen Kan, Véronique Hoste, Alessandro Lenci, Sakriani Sakti, and Nianwen Xue, editors, Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC/COLING 2024, 20-25 May, 2024, Torino, Italy, pages 5450–5460. ELRA and ICCL, 2024.
  51. Self-supervised representation learning with spatial-temporal consistency for sign language recognition. IEEE Trans. Image Process., 33:4188–4201, 2024.
  52. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  53. Sign pose-based transformer for word-level sign language recognition. In IEEE/CVF Winter Conference on Applications of Computer Vision Workshops, WACV - Workshops, Waikoloa, HI, USA, January 4-8, 2022, pages 182–191. IEEE, 2022.
  54. Signbert: pre-training of hand-model-aware representation for sign language recognition. In Proceedings of the IEEE/CVF international conference on computer vision, pages 11087–11096, 2021.
  55. Human part-wise 3d motion context learning for sign language recognition. In IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, pages 20683–20693. IEEE, 2023.
  56. Transfer learning for videos: from action recognition to sign language recognition. In 2020 IEEE International Conference on Image Processing, pages 1811–1815. IEEE, 2020.
  57. Self-supervised learning of 3d human pose using multi-view geometry. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1077–1086, 2019.
  58. Adaptive fusion and category-level dictionary learning model for multiview human action recognition. IEEE Internet of Things Journal, 6(6):9280–9293, 2019.
  59. View adaptive neural networks for high performance skeleton-based human action recognition. IEEE Trans. Pattern Anal. Mach. Intell., 41(8):1963–1978, 2019.
  60. Multi-view representation learning for multi-view action recognition. Journal of Visual Communication and Image Representation, 48:453–460, 2017.
  61. Fine-grained action recognition using multi-view attentions. Vis. Comput., 36(9):1771–1781, 2020.
  62. A cuboid cnn model with an attention mechanism for skeleton-based action recognition. IEEE Transactions on Multimedia, 22(11):2977–2989, 2019.
  63. Cross-view action recognition via transferable dictionary learning. IEEE Transactions on Image Processing, 25(6):2542–2556, 2016.
  64. Dense trajectories and motion boundary descriptors for action recognition. Int. J. Comput. Vis., 103(1):60–79, 2013.
  65. Dividing and aggregating network for multi-view action recognition. In Proceedings of the European conference on computer vision, pages 451–467, 2018.
  66. Deep multimodal feature analysis for action recognition in rgb+ d videos. IEEE transactions on pattern analysis and machine intelligence, 40(5):1045–1058, 2017.
  67. Missing modalities imputation via cascaded residual autoencoder. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1405–1414, 2017.
  68. Alphapose: Whole-body regional multi-person pose estimation and tracking in real-time. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
  69. RMPE: Regional multi-person pose estimation. In ICCV, 2017.
  70. Crowdpose: Efficient crowded scenes pose estimation and a new benchmark. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10863–10872, 2019.
  71. Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision, pages 20–36. Springer, 2016.
  72. Is space-time attention all you need for video understanding? In ICML, volume 2, page 4, 2021.
  73. A unified multimodal de- and re-coupling framework for RGB-D motion recognition. IEEE Trans. Pattern Anal. Mach. Intell., 45(10):11428–11442, 2023.
  74. Sign pose-based transformer for word-level sign language recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) Workshops, pages 182–191, January 2022.
  75. Ham2pose: Animating sign language notation into pose sequences. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 21046–21056. IEEE, 2023.
  76. Signing at scale: Learning to co-articulate signs for large-scale photo-realistic sign language production. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 5131–5141. IEEE, 2022.
  77. Progressive transformers for end-to-end sign language production. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XI, volume 12356 of Lecture Notes in Computer Science, pages 687–705, 2020.
  78. Tspnet: Hierarchical feature learning via temporal semantic pyramid for sign language translation. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
  79. Mpp-net: Multi-perspective perception network for dense video captioning. Neurocomputing, 552:126523, 2023.
  80. Visual-gestural interface for auslan virtual assistant. In June Kim, Miu Ling Lam, and Kouta Minamizawa, editors, SIGGRAPH Asia 2023 Emerging Technologies, Sydney, NSW, Australia, December 12-15, 2023, pages 21:1–21:2, 2023.
  81. Text is NOT enough: Integrating visual impressions into open-domain dialogue generation. In Heng Tao Shen, Yueting Zhuang, John R. Smith, Yang Yang, Pablo César, Florian Metze, and Balakrishnan Prabhakaran, editors, MM ’21: ACM Multimedia Conference, Virtual Event, China, October 20 - 24, 2021, pages 4287–4296. ACM, 2021.
  82. Integrating Auslan resources into the language data commons of Australia. In Proceedings of the LREC2022 10th Workshop on the Representation and Processing of Sign Languages: Multilingual Sign Language Resources, pages 181–186, Marseille, France, June 2022. European Language Resources Association.

Summary

We haven't generated a summary for this paper yet.