Towards Privacy-Aware Sign Language Translation at Scale (2402.09611v2)
Abstract: A major impediment to the advancement of sign language translation (SLT) is data scarcity. Much of the sign language data currently available on the web cannot be used for training supervised models due to the lack of aligned captions. Furthermore, scaling SLT using large-scale web-scraped datasets bears privacy risks due to the presence of biometric information, which the responsible development of SLT technologies should account for. In this work, we propose a two-stage framework for privacy-aware SLT at scale that addresses both of these issues. We introduce SSVP-SLT, which leverages self-supervised video pretraining on anonymized and unannotated videos, followed by supervised SLT finetuning on a curated parallel dataset. SSVP-SLT achieves state-of-the-art finetuned and zero-shot gloss-free SLT performance on the How2Sign dataset, outperforming the strongest respective baselines by over 3 BLEU-4. Based on controlled experiments, we further discuss the advantages and limitations of self-supervised pretraining and anonymization via facial obfuscation for SLT.
- Bbc-oxford british sign language dataset. arXiv preprint.
- Charlotte Baker-Shenk. 1985. The facial behavior of deaf signers: Evidence of a complex language. American Annals of the Deaf, 130(4).
- Neural sign language translation. In CVPR.
- Sign language transformers: Joint end-to-end sign language recognition and translation. In CVPR.
- João Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? A new model and the kinetics dataset. In CVPR.
- A simple multi-modality transfer learning baseline for sign language translation. In CVPR.
- Two-stream network for sign language recognition and translation. In NeurIPS.
- Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. In ICML.
- Randaugment: Practical automated data augmentation with a reduced search space. In NeurIPS.
- Mathieu De Coster and Joni Dambre. 2022. Leveraging frozen pretrained written language models for neural sign language translation. Information, 13(5).
- BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL.
- Sign language video retrieval with free-form textual queries. In CVPR.
- How2sign: A large-scale multimodal dataset for continuous american sign language. In CVPR.
- Cynthia Dwork. 2006. Differential privacy. In ICALP.
- Masked autoencoders as spatiotemporal learners. In NeurIPS.
- Large-scale privacy protection in google street view. In ICCV.
- Tight analysis of privacy and utility tradeoff in approximate differential privacy. In AISTATS.
- Masked autoencoders are scalable vision learners. In CVPR.
- Masked autoencoders that listen. In NeurIPS.
- Amy Isard. 2020. Approaches to the anonymisation of sign language corpora. In LREC Workshop on the Representation and Processing of Sign Languages: Sign Language Resources in the Service of the Language Community, Technological Challenges and Application Perspectives.
- The kinetics human action video dataset. arXiv preprint.
- Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In ICLR.
- American sign language video anonymization to support online participation of deaf and hard of hearing users. In ASSETS.
- BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In ACL.
- Scaling language-image pre-training via masking. In CVPR.
- Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out.
- Gloss-free end-to-end sign language translation. In ACL.
- Multilingual denoising pre-training for neural machine translation. TACL, 8.
- Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In ICLR.
- Ross E Mitchell and Travas A Young. 2022. How Many People Use Sign Language? A National Health Survey-Based Estimate. The Journal of Deaf Studies and Deaf Education, 28(1).
- Evaluating the immediate applicability of pose estimation for sign language recognition. In CVPR Workshops.
- Findings of the second WMT shared task on sign language translation (WMT-SLT23). In WMT.
- Considerations for meaningful sign language machine translation based on glosses. In ACL.
- The Syntax of American Sign Language: Functional Categories and Hierarchical Structure. MIT Press.
- Bleu: a method for automatic evaluation of machine translation. In ACL.
- Pytorch: An imperative style, high-performance deep learning library. In NeurIPS.
- Matt Post. 2018. A call for clarity in reporting BLEU scores. In WMT.
- Learning transferable visual models from natural language supervision. In ICML.
- Improving language understanding by generative pre-training. OpenAI Technical Report.
- Language models are unsupervised multitask learners. OpenAI Technical Report.
- Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR (21).
- Language modelling with pixels. In ICLR.
- Hiera: A hierarchical vision transformer without the bells-and-whistles. In ICML.
- Self-supervised video transformers for isolated sign language recognition. arXiv preprint.
- Anonysign: Novel human appearance synthesis for sign language video anonymisation. In FG.
- BLEURT: Learning robust metrics for text generation. In ACL.
- Noam Shazeer. 2020. Glu variants improve transformer. arXiv preprint.
- Open-domain sign language translation learned from online video. In EMNLP.
- William C Stokoe. 1980. Sign language structure. Annual review of anthropology, 9(1).
- Sign language translation from instructional videos. In CVPR Workshops.
- Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. In NeurIPS.
- Cartoonized anonymization of sign language videos. In IVSMP.
- Youtube-asl: A large-scale, open-domain american sign language-english parallel corpus. In NeurIPS.
- Clayton Valli and Ceil Lucas. 2000. Linguistics of American Sign Language: An Introduction. Gallaudet University Press.
- Attention is all you need. In NeurIPS.
- Videomae V2: scaling video masked autoencoders with dual masking. In CVPR.
- Sign language video anonymization. In LREC Workshop on the Representation and Processing of Sign Languages: Multilingual Sign Language Resources.
- Diffslva: Harnessing diffusion models for sign language video anonymization. arXiv preprint.
- A study of face obfuscation in imagenet. In ICML.
- Gloss attention for gloss-free sign language translation. In CVPR.
- Including signed languages in natural language processing. In ACL.
- SLTUNET: A simple unified model for sign language translation. In ICLR.
- Gloss-free sign language translation: Improving from visual-language pretraining. In ICCV.
- Phillip Rust (12 papers)
- Bowen Shi (82 papers)
- Skyler Wang (10 papers)
- Jean Maillard (17 papers)
- Necati Cihan Camgöz (5 papers)