A Tale of Two Languages: Large-Vocabulary Continuous Sign Language Recognition from Spoken Language Supervision (2405.10266v1)
Abstract: In this work, our goals are two fold: large-vocabulary continuous sign language recognition (CSLR), and sign language retrieval. To this end, we introduce a multi-task Transformer model, CSLR2, that is able to ingest a signing sequence and output in a joint embedding space between signed language and spoken language text. To enable CSLR evaluation in the large-vocabulary setting, we introduce new dataset annotations that have been manually collected. These provide continuous sign-level annotations for six hours of test videos, and will be made publicly available. We demonstrate that by a careful choice of loss functions, training the model for both the CSLR and retrieval tasks is mutually beneficial in terms of performance -- retrieval improves CSLR performance by providing context, while CSLR improves retrieval with more fine-grained supervision. We further show the benefits of leveraging weak and noisy supervision from large-vocabulary datasets such as BOBSL, namely sign-level pseudo-labels, and English subtitles. Our model significantly outperforms the previous state of the art on both tasks.
- A comprehensive study on sign language recognition methods. arXiv, 2020.
- BSL-1K: Scaling up co-articulated sign language recognition using mouthing cues. In Proc. ECCV, 2020.
- Signer diarisation in the wild. In Technical Report, 2021a.
- BOBSL: BBC-Oxford British Sign Language dataset. arXiv, 2021b.
- Large lexicon project: American sign language video corpus and sign language indexing/retrieval algorithms. In LREC, 2010.
- Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proc. ICCV, 2021.
- Sign language recognition, generation, and translation: An interdisciplinary perspective. In ACM SIGACCESS, 2019.
- Long term arm and hand tracking for continuous sign language TV broadcasts. In Proc. BMVC, 2008.
- Automatic segmentation of sign language into subtitle-units. In ECCVW, 2020.
- Aligning subtitles in sign language videos. In Proc. ICCV, 2021.
- Neural sign language translation. In CVPR, 2018.
- Multi-channel transformers for multi-articulatory sign language translation. In ECCVW, 2020a.
- Sign language transformers: Joint end-to-end sign language recognition and translation. In CVPR, 2020b.
- Content4all open research sign language translation datasets. arXiv, 2021.
- Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, 2017.
- Fully convolutional networks for continuous sign language recognition. In ECCV, 2020.
- CiCo: Domain-aware sign language retrieval via cross-lingual contrastive learning. In CVPR, 2023.
- Pronouns and pointing in sign languages. Lingua, 137:230–247, 2013.
- A deep neural framework for continuous sign language recognition by iterative training. IEEE Transactions on Multimedia, 2019.
- Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL, 2018.
- An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
- Efficient approximations to model-based joint tracking and recognition of continuous sign language. In IEEE International Conference on Automatic Face and Gesture Recognition, 2008.
- How2Sign: A Large-scale Multimodal Dataset for Continuous American Sign Language. In CVPR, 2021.
- Sign language video retrieval with free-form textual queries. In CVPR, 2022.
- The via annotation software for images, audio and video. In Proc. ACMM, 2019.
- Michael Filhol. Elicitation and corpus of spontaneous sign language discourse representation diagrams. In LREC, 2020.
- Multi-modal transformer for video retrieval. In ECCV, 2020.
- Thomas Hanke. HamNoSys - representing sign language data in language resources and language processing contexts. In LREC Workshop proceedings: Representation and processing of sign languages, 2004.
- Video-based sign language recognition without temporal segmentation. In AAAI, 2018.
- CoSign: Exploring co-occurrence signals in skeleton-based continuous sign language recognition. In ICCV, 2023.
- Hamid Reza Vaezi Joze and Oscar Koller. MS-ASL: A large-scale data set and benchmark for understanding American Sign Language. In BMVC, 2019.
- Adam: A method for stochastic optimization. arXiv, 2014.
- Neural sign language translation based on human keypoint estimation. Appl. Sci., 2019.
- Statistical phrase-based translation. In Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, 2003.
- Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling multiple signers. Computer Vision and Image Understanding, 141:108–125, 2015.
- Word-level deep sign language recognition from video: A new large-scale dataset and methods comparison. In WACV, 2019.
- BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022.
- Hit: Hierarchical transformer with momentum contrast for video-text retrieval. In ICCV, 2021.
- Use what you have: Video retrieval using representations from collaborative experts. In Proc. BMVC, 2019.
- Video swin transformer. In CVPR, 2022.
- Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning. Neurocomputing, 508:293–304, 2022.
- Watch, read and lookup: Learning to spot signs from multiple supervisors. In Proc. ACCV, 2020.
- Automatic dense annotation of large-vocabulary sign language videosa. In Proc. ECCV, 2022.
- Weakly-supervised fingerspelling recognition in british sign language videos. In Proc. BMVC, 2022.
- Filtering, distillation, and hard negatives for vision-language pre-training. In arXiv, 2023.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
- Sign segmentation with temporal convolutional networks. In ICASSP, 2021a.
- Sign segmentation with changepoint-modulated pseudo-labelling. In CVPRW. IEEE, 2021b.
- Building the British sign language corpus. Language Documentation & Conservation, 7:136–154, 2013.
- British Sign Language Corpus Project: A corpus of digital video data and annotations of British Sign Language 2008-2017 (Third Edition), 2017.
- Open-domain sign language translation learned from online video. In EMNLP, 2022.
- MPNet: Masked and permuted pre-training for language understanding. NeurIPS, 2020.
- Videobert: A joint model for video and language representation learning. In ICCV, 2019.
- Valerie Sutton. Lessons in sign writing, 1990. SignWriting.
- Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
- Read and attend: Temporal localisation in sign language videos. In Proc. CVPR, 2021.
- Attention is all you need. In NeurIPS, 2017.
- The significance of facial features for automatic sign language recognition. In 8th IEEE International Conference on Automatic Face Gesture Recognition, 2008.
- ActionCLIP: A new paradigm for video action recognition. arXiv:2109.08472, 2021.
- Improving continuous sign language recognition with cross-lingual signs. In ICCV, 2023.
- Purdue RVL-SLLL American sign language database. Technical Report, 2006.
- Gloss attention for gloss-free sign language translation. In CVPR, 2023.
- Coca: Contrastive captioners are image-text foundation models. arXiv, 2022.
- A joint sequence fusion model for video question answering and retrieval. In ECCV, 2018.
- C2ST: Cross-modal contextualized sequence transduction for continuous sign language recognition. In ICCV, 2023.
- Using revised string edit distance to sign language video retrieval. In 2010 Second International Conference on Computational Intelligence and Natural Computing, pages 45–49. IEEE, 2010.
- Gloss-free sign language translation: Improving from visual-language pretraining. In ICCV, 2023.
- Improving sign language translation with monolingual data by sign back-translation. In CVPR, 2020.
- C2SLR: Consistency-enhanced continuous sign language recognition. In CVPR, 2022.