Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
157 tokens/sec
GPT-4o
43 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Tale of Two Languages: Large-Vocabulary Continuous Sign Language Recognition from Spoken Language Supervision (2405.10266v1)

Published 16 May 2024 in cs.CV and cs.CL

Abstract: In this work, our goals are two fold: large-vocabulary continuous sign language recognition (CSLR), and sign language retrieval. To this end, we introduce a multi-task Transformer model, CSLR2, that is able to ingest a signing sequence and output in a joint embedding space between signed language and spoken language text. To enable CSLR evaluation in the large-vocabulary setting, we introduce new dataset annotations that have been manually collected. These provide continuous sign-level annotations for six hours of test videos, and will be made publicly available. We demonstrate that by a careful choice of loss functions, training the model for both the CSLR and retrieval tasks is mutually beneficial in terms of performance -- retrieval improves CSLR performance by providing context, while CSLR improves retrieval with more fine-grained supervision. We further show the benefits of leveraging weak and noisy supervision from large-vocabulary datasets such as BOBSL, namely sign-level pseudo-labels, and English subtitles. Our model significantly outperforms the previous state of the art on both tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (70)
  1. A comprehensive study on sign language recognition methods. arXiv, 2020.
  2. BSL-1K: Scaling up co-articulated sign language recognition using mouthing cues. In Proc. ECCV, 2020.
  3. Signer diarisation in the wild. In Technical Report, 2021a.
  4. BOBSL: BBC-Oxford British Sign Language dataset. arXiv, 2021b.
  5. Large lexicon project: American sign language video corpus and sign language indexing/retrieval algorithms. In LREC, 2010.
  6. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proc. ICCV, 2021.
  7. Sign language recognition, generation, and translation: An interdisciplinary perspective. In ACM SIGACCESS, 2019.
  8. Long term arm and hand tracking for continuous sign language TV broadcasts. In Proc. BMVC, 2008.
  9. Automatic segmentation of sign language into subtitle-units. In ECCVW, 2020.
  10. Aligning subtitles in sign language videos. In Proc. ICCV, 2021.
  11. Neural sign language translation. In CVPR, 2018.
  12. Multi-channel transformers for multi-articulatory sign language translation. In ECCVW, 2020a.
  13. Sign language transformers: Joint end-to-end sign language recognition and translation. In CVPR, 2020b.
  14. Content4all open research sign language translation datasets. arXiv, 2021.
  15. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, 2017.
  16. Fully convolutional networks for continuous sign language recognition. In ECCV, 2020.
  17. CiCo: Domain-aware sign language retrieval via cross-lingual contrastive learning. In CVPR, 2023.
  18. Pronouns and pointing in sign languages. Lingua, 137:230–247, 2013.
  19. A deep neural framework for continuous sign language recognition by iterative training. IEEE Transactions on Multimedia, 2019.
  20. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL, 2018.
  21. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
  22. Efficient approximations to model-based joint tracking and recognition of continuous sign language. In IEEE International Conference on Automatic Face and Gesture Recognition, 2008.
  23. How2Sign: A Large-scale Multimodal Dataset for Continuous American Sign Language. In CVPR, 2021.
  24. Sign language video retrieval with free-form textual queries. In CVPR, 2022.
  25. The via annotation software for images, audio and video. In Proc. ACMM, 2019.
  26. Michael Filhol. Elicitation and corpus of spontaneous sign language discourse representation diagrams. In LREC, 2020.
  27. Multi-modal transformer for video retrieval. In ECCV, 2020.
  28. Thomas Hanke. HamNoSys - representing sign language data in language resources and language processing contexts. In LREC Workshop proceedings: Representation and processing of sign languages, 2004.
  29. Video-based sign language recognition without temporal segmentation. In AAAI, 2018.
  30. CoSign: Exploring co-occurrence signals in skeleton-based continuous sign language recognition. In ICCV, 2023.
  31. Hamid Reza Vaezi Joze and Oscar Koller. MS-ASL: A large-scale data set and benchmark for understanding American Sign Language. In BMVC, 2019.
  32. Adam: A method for stochastic optimization. arXiv, 2014.
  33. Neural sign language translation based on human keypoint estimation. Appl. Sci., 2019.
  34. Statistical phrase-based translation. In Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, 2003.
  35. Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling multiple signers. Computer Vision and Image Understanding, 141:108–125, 2015.
  36. Word-level deep sign language recognition from video: A new large-scale dataset and methods comparison. In WACV, 2019.
  37. BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022.
  38. Hit: Hierarchical transformer with momentum contrast for video-text retrieval. In ICCV, 2021.
  39. Use what you have: Video retrieval using representations from collaborative experts. In Proc. BMVC, 2019.
  40. Video swin transformer. In CVPR, 2022.
  41. Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning. Neurocomputing, 508:293–304, 2022.
  42. Watch, read and lookup: Learning to spot signs from multiple supervisors. In Proc. ACCV, 2020.
  43. Automatic dense annotation of large-vocabulary sign language videosa. In Proc. ECCV, 2022.
  44. Weakly-supervised fingerspelling recognition in british sign language videos. In Proc. BMVC, 2022.
  45. Filtering, distillation, and hard negatives for vision-language pre-training. In arXiv, 2023.
  46. Learning transferable visual models from natural language supervision. In ICML, 2021.
  47. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  48. Sign segmentation with temporal convolutional networks. In ICASSP, 2021a.
  49. Sign segmentation with changepoint-modulated pseudo-labelling. In CVPRW. IEEE, 2021b.
  50. Building the British sign language corpus. Language Documentation & Conservation, 7:136–154, 2013.
  51. British Sign Language Corpus Project: A corpus of digital video data and annotations of British Sign Language 2008-2017 (Third Edition), 2017.
  52. Open-domain sign language translation learned from online video. In EMNLP, 2022.
  53. MPNet: Masked and permuted pre-training for language understanding. NeurIPS, 2020.
  54. Videobert: A joint model for video and language representation learning. In ICCV, 2019.
  55. Valerie Sutton. Lessons in sign writing, 1990. SignWriting.
  56. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  57. Read and attend: Temporal localisation in sign language videos. In Proc. CVPR, 2021.
  58. Attention is all you need. In NeurIPS, 2017.
  59. The significance of facial features for automatic sign language recognition. In 8th IEEE International Conference on Automatic Face Gesture Recognition, 2008.
  60. ActionCLIP: A new paradigm for video action recognition. arXiv:2109.08472, 2021.
  61. Improving continuous sign language recognition with cross-lingual signs. In ICCV, 2023.
  62. Purdue RVL-SLLL American sign language database. Technical Report, 2006.
  63. Gloss attention for gloss-free sign language translation. In CVPR, 2023.
  64. Coca: Contrastive captioners are image-text foundation models. arXiv, 2022.
  65. A joint sequence fusion model for video question answering and retrieval. In ECCV, 2018.
  66. C2ST: Cross-modal contextualized sequence transduction for continuous sign language recognition. In ICCV, 2023.
  67. Using revised string edit distance to sign language video retrieval. In 2010 Second International Conference on Computational Intelligence and Natural Computing, pages 45–49. IEEE, 2010.
  68. Gloss-free sign language translation: Improving from visual-language pretraining. In ICCV, 2023.
  69. Improving sign language translation with monolingual data by sign back-translation. In CVPR, 2020.
  70. C2SLR: Consistency-enhanced continuous sign language recognition. In CVPR, 2022.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets