- The paper introduces a scalable dataset, BSL-1K, that uses mouthing cues from interpreters to enhance sign annotation.
- It details a novel methodology extracting 1,064 signs from 1,000 hours of TV broadcasts using visual keyword spotting.
- The study demonstrates cross-linguistic utility, with models pre-trained on BSL-1K excelling on MSASL and WLASL benchmarks.
Scaling Up Co-Articulated Sign Language Recognition Through Mouthing Cues: An Examination of the BSL-1K Dataset
The paper at hand focuses on advancing the domain of automated sign language recognition, tackling the challenges inherent in the complexity of sign annotation and the scarcity of annotated data. The authors present a novel contribution to this domain through the introduction of the BSL-1K dataset. This dataset is notable for its scale and methodology, leveraging mouthing cues from British Sign Language (BSL) interpreters to enhance the data collection process.
The paper systematically details the steps taken to build the BSL-1K dataset, a robust collection of sign language video data that includes 1,064 BSL signs extracted from 1,000 hours of subtitle-aligned British television broadcasts. This is achieved through an innovative use of visual keyword spotting to automatically annotate signs in continuous video streams, leveraging the frequent mouthing of words by sign language interpreters as an alignment signal with weakly-timed subtitles.
Several key results stem from this paper:
- Annotated Dataset Construction: By employing mouthing recognition for signature words in the subtitles, the paper creates a large-scale annotated dataset, BSL-1K, significantly expanding the available resources for training sign language recognition models.
- Model Training and Evaluation: The trained models using BSL-1K demonstrate strong performance in recognizing co-articulated signs, and notably achieve superior results on the MSASL and WLASL benchmarks for sign language recognition, underscoring the dataset's robustness and the models' utility.
- Generalization Across Datasets and Languages: By showing utility beyond BSL, as models pre-trained on BSL-1K provide effective initialization for American Sign Language datasets, this work suggests the potential for cross-linguistic applications of the approaches pioneered.
- Baseline Creation for Future Work: New evaluation sets for sign recognition and spotting derived from this dataset lay the groundwork for future research, establishing baseline metrics for the continued evolution of this field.
This paper contributes significantly to both the practical and theoretical aspects of sign language recognition. Practically, it provides a scalable approach to developing annotated datasets critical for training data-intensive sign recognition models. Theoretically, it opens new pathways for using unconventional cues such as mouthing to align visual and linguistic data, challenging traditional reliance on gesture and facial expression alone.
Looking toward future developments in AI from this work, a primary consideration is the extension of these techniques to other sign languages with distinct grammatical and expressive features. Additionally, integrating these recognition systems into socially impactful applications, such as enhancing communication accessibility for deaf communities or improving educational resources for hearing individuals learning sign languages, align well with broader socially responsible AI initiatives.
In conclusion, the introduction of the BSL-1K dataset and the method of leveraging mouthing cues represents a significant leap in the automated recognition of co-articulated sign languages. It offers a scalable, efficient, and less annotation-heavy pathway towards creating large datasets, ultimately contributing to the broader field's goal of moving from isolated sign recognition to continuous, sentence-level translation systems.