BSL-1K: Scaling up co-articulated sign language recognition using mouthing cues (2007.12131v2)

Published 23 Jul 2020 in cs.CV

Abstract: Recent progress in fine-grained gesture and action classification, and machine translation, point to the possibility of automated sign language recognition becoming a reality. A key stumbling block in making progress towards this goal is a lack of appropriate training data, stemming from the high complexity of sign annotation and a limited supply of qualified annotators. In this work, we introduce a new scalable approach to data collection for sign recognition in continuous videos. We make use of weakly-aligned subtitles for broadcast footage together with a keyword spotting method to automatically localise sign-instances for a vocabulary of 1,000 signs in 1,000 hours of video. We make the following contributions: (1) We show how to use mouthing cues from signers to obtain high-quality annotations from video data - the result is the BSL-1K dataset, a collection of British Sign Language (BSL) signs of unprecedented scale; (2) We show that we can use BSL-1K to train strong sign recognition models for co-articulated signs in BSL and that these models additionally form excellent pretraining for other sign languages and benchmarks - we exceed the state of the art on both the MSASL and WLASL benchmarks. Finally, (3) we propose new large-scale evaluation sets for the tasks of sign recognition and sign spotting and provide baselines which we hope will serve to stimulate research in this area.

Citations (153)

View on Semantic Scholar

Summary

The paper introduces a scalable dataset, BSL-1K, that uses mouthing cues from interpreters to enhance sign annotation.
It details a novel methodology extracting 1,064 signs from 1,000 hours of TV broadcasts using visual keyword spotting.
The study demonstrates cross-linguistic utility, with models pre-trained on BSL-1K excelling on MSASL and WLASL benchmarks.

Scaling Up Co-Articulated Sign Language Recognition Through Mouthing Cues: An Examination of the BSL-1K Dataset

The paper at hand focuses on advancing the domain of automated sign language recognition, tackling the challenges inherent in the complexity of sign annotation and the scarcity of annotated data. The authors present a novel contribution to this domain through the introduction of the BSL-1K dataset. This dataset is notable for its scale and methodology, leveraging mouthing cues from British Sign Language (BSL) interpreters to enhance the data collection process.

The paper systematically details the steps taken to build the BSL-1K dataset, a robust collection of sign language video data that includes 1,064 BSL signs extracted from 1,000 hours of subtitle-aligned British television broadcasts. This is achieved through an innovative use of visual keyword spotting to automatically annotate signs in continuous video streams, leveraging the frequent mouthing of words by sign language interpreters as an alignment signal with weakly-timed subtitles.

Several key results stem from this paper:

Annotated Dataset Construction: By employing mouthing recognition for signature words in the subtitles, the paper creates a large-scale annotated dataset, BSL-1K, significantly expanding the available resources for training sign language recognition models.
Model Training and Evaluation: The trained models using BSL-1K demonstrate strong performance in recognizing co-articulated signs, and notably achieve superior results on the MSASL and WLASL benchmarks for sign language recognition, underscoring the dataset's robustness and the models' utility.
Generalization Across Datasets and Languages: By showing utility beyond BSL, as models pre-trained on BSL-1K provide effective initialization for American Sign Language datasets, this work suggests the potential for cross-linguistic applications of the approaches pioneered.
Baseline Creation for Future Work: New evaluation sets for sign recognition and spotting derived from this dataset lay the groundwork for future research, establishing baseline metrics for the continued evolution of this field.

This paper contributes significantly to both the practical and theoretical aspects of sign language recognition. Practically, it provides a scalable approach to developing annotated datasets critical for training data-intensive sign recognition models. Theoretically, it opens new pathways for using unconventional cues such as mouthing to align visual and linguistic data, challenging traditional reliance on gesture and facial expression alone.

Looking toward future developments in AI from this work, a primary consideration is the extension of these techniques to other sign languages with distinct grammatical and expressive features. Additionally, integrating these recognition systems into socially impactful applications, such as enhancing communication accessibility for deaf communities or improving educational resources for hearing individuals learning sign languages, align well with broader socially responsible AI initiatives.

In conclusion, the introduction of the BSL-1K dataset and the method of leveraging mouthing cues represents a significant leap in the automated recognition of co-articulated sign languages. It offers a scalable, efficient, and less annotation-heavy pathway towards creating large datasets, ultimately contributing to the broader field's goal of moving from isolated sign recognition to continuous, sentence-level translation systems.

PDF Markdown

Related Papers

YouTube

Show All Videos