YouTube-SL-25: A Large-Scale, Open-Domain Multilingual Sign Language Parallel Corpus (2407.11144v1)

Published 15 Jul 2024 in cs.CL

Abstract: Even for better-studied sign languages like American Sign Language (ASL), data is the bottleneck for machine learning research. The situation is worse yet for the many other sign languages used by Deaf/Hard of Hearing communities around the world. In this paper, we present YouTube-SL-25, a large-scale, open-domain multilingual corpus of sign language videos with seemingly well-aligned captions drawn from YouTube. With >3000 hours of videos across >25 sign languages, YouTube-SL-25 is a) >3x the size of YouTube-ASL, b) the largest parallel sign language dataset to date, and c) the first or largest parallel dataset for many of its component languages. We provide baselines for sign-to-text tasks using a unified multilingual multitask model based on T5 and report scores on benchmarks across 4 sign languages. The results demonstrate that multilingual transfer benefits both higher- and lower-resource sign languages within YouTube-SL-25.

PDF HTML Abstract

Overview of "YouTube-SL-25: A Large-Scale, Open-Domain Multilingual Sign Language Parallel Corpus"

The paper "YouTube-SL-25: A Large-Scale, Open-Domain Multilingual Sign Language Parallel Corpus" by Garrett Tanzer and Biao Zhang introduces a novel and expansive dataset for sign languages. Sign languages, used by Deaf/Hard of Hearing communities worldwide, are visuospatial languages, making their processing with machine learning particularly challenging. Data scarcity has been a notable bottleneck, especially for sign languages other than American Sign Language (ASL). The YouTube-SL-25 dataset addresses this by offering a substantially larger and more diverse open-domain multilingual corpus, containing over 3000 hours of video across more than 25 sign languages.

Dataset Collection and Characteristics

The YouTube-SL-25 corpus comprises 3207 hours of video, featuring more than 3000 unique signers, and is significantly larger than any existing parallel sign language dataset. This collection refers to more than 25 sign languages, making it the first or largest for many of these languages. The video data is mined using a two-step process: automatic classification of text metadata to identify potential videos and subsequent manual auditing by the first author to ensure content quality. This manual review, although conducted with less expertise than previous ASL-specific datasets, leverages various non-verbal indicators to identify high-quality sign language content, producing a dataset of seemingly well-aligned captions.

Baseline Experiments and Results

The authors provide baseline results for sign-to-text tasks using a unified multilingual multitask model based on T5. They report on benchmarks across four sign languages: ASL, Swiss German Sign Language (SGS), Swiss French Sign Language (SFS), and Swiss Italian Sign Language (SIS). Quantitative results for translation and language identification tasks show that multilingual transfer enhances performance for both high- and low-resource sign languages. Notably, pretraining on the complete YouTube-SL-25 dataset yielded significant improvements in translation accuracy, measured by BLEURT scores, demonstrating the efficacy of such a comprehensive dataset.

Numerical Results and Claims

Key numerical results from the paper include:

The dataset features 3207 hours of video content across 39197 videos, outscoring the largest previous dataset by a significant margin.
Pretraining on YouTube-SL-25 led to BLEURT translation scores of 47.9 for ASL in the How2Sign benchmark, tripling the baseline score without pretraining.
Sign language identification accuracy reached 100% for high-resource languages after pretraining on the full dataset.

These results underscore the descriptive power of the dataset and emphasize the benefits of multilingual transfer learning.

Practical and Theoretical Implications

Practically, YouTube-SL-25 sets a new standard for sign language datsets, facilitating improved machine learning models for a wide range of sign languages. This dataset enables better pretraining models, leading to enhanced translation and comprehension systems that could significantly aid Deaf/Hard of Hearing communities. Theoretically, this corpus might encourage research into robust filtering and preprocessing techniques for sign languages, potentially opening pathways to even larger and more richly annotated datasets.

Future Developments

Looking ahead, future work could focus on several avenues:

Refinement and annotation by native signers to improve dataset quality.
Development of robust filtering tools to manage and scale sign language data collection.
Extension of the benchmark evaluations to more sign languages within the YouTube-SL-25 corpus, fostering broader multilingual model assessments.

The quality and scale of the YouTube-SL-25 dataset position it as a pivotal resource for advancing the state of sign language processing in machine learning, holding promise for increasingly inclusive and effective communication technologies.

Conclusion

The YouTube-SL-25 dataset represents a significant advancement in the availability and diversity of sign language data. By leveraging large-scale, open-domain multilingual video data from YouTube, the authors have addressed substantial gaps in previous datasets. Baseline experiments validate the utility of this dataset, showing strong results in translation and identification tasks. Future work will likely build on this foundation to continue expanding and improving resources and methodologies for sign language processing.

PDF Markdown Bookmark Chat (Pro)

Authors (2)

Garrett Tanzer (11 papers)
Biao Zhang (76 papers)

Citations (2)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/_akhaliq/status/1813403667585699955

https://twitter.com/BZhangGo/status/1813656501799747832