DiffSLVA: Harnessing Diffusion Models for Sign Language Video Anonymization (2311.16060v1)
Abstract: Since American Sign Language (ASL) has no standard written form, Deaf signers frequently share videos in order to communicate in their native language. However, since both hands and face convey critical linguistic information in signed languages, sign language videos cannot preserve signer privacy. While signers have expressed interest, for a variety of applications, in sign language video anonymization that would effectively preserve linguistic content, attempts to develop such technology have had limited success, given the complexity of hand movements and facial expressions. Existing approaches rely predominantly on precise pose estimations of the signer in video footage and often require sign language video datasets for training. These requirements prevent them from processing videos 'in the wild,' in part because of the limited diversity present in current sign language video datasets. To address these limitations, our research introduces DiffSLVA, a novel methodology that utilizes pre-trained large-scale diffusion models for zero-shot text-guided sign language video anonymization. We incorporate ControlNet, which leverages low-level image features such as HED (Holistically-Nested Edge Detection) edges, to circumvent the need for pose estimation. Additionally, we develop a specialized module dedicated to capturing facial expressions, which are critical for conveying essential linguistic information in signed languages. We then combine the above methods to achieve anonymization that better preserves the essential linguistic content of the original signer. This innovative methodology makes possible, for the first time, sign language video anonymization that could be used for real-world applications, which would offer significant benefits to the Deaf and Hard-of-Hearing communities. We demonstrate the effectiveness of our approach with a series of signer anonymization experiments.
- Robert W Arnold. A proposal for a written system of American Sign Language. Gallaudet University, 2009.
- Charlotte Baker-Shenk. The facial behavior of deaf signers: Evidence of a complex language. American Annals of the Deaf, 130(4):297–304, 1985.
- Using a language technology infrastructure for German in order to anonymize German Sign Language corpus data. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 3303–3306, Portorož, Slovenia, 2016. European Language Resources Association (ELRA).
- Sign language recognition, generation, and translation: An interdisciplinary perspective. In The 21st international ACM SIGACCESS conference on computers and accessibility, pages 16–31, 2019.
- Exploring collection of sign language datasets: Privacy, participation, and model performance. In The 22nd International ACM SIGACCESS Conference on Computers and Accessibility, pages 1–14, 2020.
- Pix2video: Video editing using image diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 23206–23217, 2023.
- Geoffrey Restall Coulter. American Sign Language typology. PhD thesis, University of California, San Diego, 1979.
- User friendly interfaces for sign retrieval and sign synthesis. In International Conference on Universal Access in Human-Computer Interaction, pages 351–361. Springer, 2015.
- Toward an intuitive sign language animation authoring system for the deaf. Universal Access in the Information Society, 15(4):513–523, 2016.
- Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
- Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
- Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE international conference on computer vision, pages 1501–1510, 2017.
- Amy Isard. Approaches to the anonymisation of sign language corpora. In Proceedings of the LREC2020 9th Workshop on the Representation and Processing of Sign Languages: Sign Language Resources in the Service of the Language Community, Technological Challenges and Application Perspectives, pages 95–100, 2020.
- Continuous profile models in ASL syntactic facial expression synthesis. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2084–2093, 2016.
- Text2video-zero: Text-to-image diffusion models are zero-shot video generators. arXiv preprint arXiv:2303.13439, 2023.
- Assessing the deaf user perspective on sign language avatars. In The proceedings of the 13th international ACM SIGACCESS conference on Computers and accessibility, pages 107–114, 2011.
- American Sign Language Video Anonymization to Support Online Participation of Deaf and Hard of Hearing Users. In The 23rd International ACM SIGACCESS Conference on Computers and Accessibility, New York, NY, USA, 2021. Association for Computing Machinery.
- Crowdpose: Efficient crowded scenes pose estimation and a new benchmark. arXiv preprint arXiv:1812.00324, 2018.
- Video-p2p: Video editing with cross-attention control. arXiv preprint arXiv:2303.04761, 2023.
- Mediapipe: A framework for building perception pipelines. arXiv preprint arXiv:1906.08172, 2019.
- Sdedit: Guided image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073, 2021.
- Null-text inversion for editing real images using guided diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6038–6047, 2023.
- Voxceleb: a large-scale speaker identification dataset. arXiv preprint arXiv:1706.08612, 2017.
- The Syntax of American Sign Language: Functional Categories and Hierarchical Structure. MIT press, 2000.
- New shared & interconnected ASL resources: SignStream® 3 software; DAI 2 for web access to linguistically annotated video corpora; and a Sign Bank. In 8th Workshop on the Representation and Processing of Sign Languages: Involving the Language Community, Miyazaki, Language Resources and Evaluation Conference 2018, 2018.
- ASL Video Corpora & Sign Bank: Resources available through the American Sign Language Linguistic Research Poject (ASLLRP). arXiv preprint arXiv:2201.07899, 2022.
- Fatezero: Fusing attentions for zero-shot text-based video editing. arXiv preprint arXiv:2303.09535, 2023.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
- Anonysign: Novel human appearance synthesis for sign language video anonymisation. arXiv preprint arXiv:2107.10685, 2021.
- Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
- Edit-a-video: Single video editing with object-aware consistency. arXiv preprint arXiv:2303.07945, 2023.
- First order motion model for image animation. Advances in Neural Information Processing Systems, 32:7137–7147, 2019.
- Motion representations for articulated animation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13653–13662, 2021.
- Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
- Neural sign reenactor: Deep photorealistic sign language retargeting. arXiv preprint arXiv:2209.01470, 2022a.
- Cartoonized anonymization of sign language videos. In 2022 IEEE 14th Image, Video, and Multidimensional Signal Processing Workshop (IVMSP), pages 1–5. IEEE, 2022b.
- Linguistics of American Sign Language: An introduction. Gallaudet University Press, 2000.
- Zero-shot video editing using off-the-shelf image diffusion models. arXiv preprint arXiv:2303.17599, 2023.
- Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7623–7633, 2023.
- Sign language video anonymization. In 10th Workshop on the Representation and Processing of Sign Languages: Multilingual Sign Language Resources., pages 202–211. European Language Resources Association (ELRA), 2022.
- Holistically-nested edge detection. In Proceedings of the IEEE international conference on computer vision, pages 1395–1403, 2015.
- Pose Flow: Efficient online pose tracking. In BMVC, 2018.
- Gmflow: Learning optical flow via global matching. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8121–8130, 2022.
- Rerender a video: Zero-shot text-guided video-to-video translation. In ACM SIGGRAPH Asia Conference Proceedings, 2023.
- Bisenet: Bilateral segmentation network for real-time semantic segmentation. In Proceedings of the European conference on computer vision (ECCV), pages 325–341, 2018.
- Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023.
- Thin-plate spline motion model for image animation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3657–3666, 2022.