MS-ASL: A Large-Scale Data Set and Benchmark for Understanding American Sign Language (1812.01053v2)

Published 3 Dec 2018 in cs.CV

Abstract: Sign language recognition is a challenging and often underestimated problem comprising multi-modal articulators (handshape, orientation, movement, upper body and face) that integrate asynchronously on multiple streams. Learning powerful statistical models in such a scenario requires much data, particularly to apply recent advances of the field. However, labeled data is a scarce resource for sign language due to the enormous cost of transcribing these unwritten languages. We propose the first real-life large-scale sign language data set comprising over 25,000 annotated videos, which we thoroughly evaluate with state-of-the-art methods from sign and related action recognition. Unlike the current state-of-the-art, the data set allows to investigate the generalization to unseen individuals (signer-independent test) in a realistic setting with over 200 signers. Previous work mostly deals with limited vocabulary tasks, while here, we cover a large class count of 1000 signs in challenging and unconstrained real-life recording conditions. We further propose I3D, known from video classifications, as a powerful and suitable architecture for sign language recognition, outperforming the current state-of-the-art by a large margin. The data set is publicly available to the community.

Citations (224)

View on Semantic Scholar

Summary

The paper presents a novel, large-scale dataset addressing the scarcity of labeled American Sign Language videos to enhance recognition systems.
It outlines a multi-stage video processing pipeline leveraging face detection, temporal segmentation, and signer-specific bounding box extraction for robust data annotation.
The evaluation demonstrates that the 3D-CNN I3D model achieves up to 81.08% top-five accuracy, highlighting its effectiveness in handling signer variability and real-world conditions.

MS-ASL: A Large-Scale Data Set and Benchmark for Understanding American Sign Language

The paper "MS-ASL: A Large-Scale Data Set and Benchmark for Understanding American Sign Language" by Hamid Reza Vaezi Joze and Oscar Koller addresses a critical gap in the sign language recognition (SLR) research by presenting a novel dataset designed to advance the state of the art in this domain. This research underscores the need for substantial datasets to leverage recent advancements in deep learning, which have significantly improved performance across various computer vision tasks. The scarcity of large labeled datasets has been a limiting factor for the application of modern machine learning techniques to sign language recognition.

Dataset and Methodology

The MS-ASL dataset consists of over 25,000 annotated videos of 1,000 distinct signs performed by more than 200 signers. This dataset stands out due to its extensive vocabulary and the diversity of its participants, which allows for robust evaluation of algorithms under realistic, signer-independent conditions. The dataset is a significant advancement over previous datasets, which were limited in vocabulary scope and did not thoroughly address the variability presented by different signers and environmental conditions.

The authors utilize a methodological approach that involves multiple stages of video processing. Initially, the videos are gathered from public platforms where they are often accompanied by captions or titles that help in labeling. Further stages include face detection for signer identification and temporal segmentation to isolate relevant segments that depict the signs. The videos were processed to extract signer-specific bounding boxes, which serve as inputs for the models being evaluated.

Evaluation and Baseline Approaches

To benchmark the dataset, several state-of-the-art sign language recognition approaches were evaluated, including:

2D-CNN+LSTM: A framework combining 2D convolutional neural networks with long short-term memory networks to capture temporal and spatial features.
Body Key-Points (HCN): Employing hierarchical co-occurrence networks to leverage human body key-points, capturing crucial dynamic and spatial features for gesture recognition.
3D-CNN (I3D): Implementing Inflated 3D ConvNets known for their efficacy in action recognition tasks. The authors found this approach to outperform others significantly, showcasing its suitability for SLR tasks.

The I3D architecture exhibits substantial performance improvements, as evidenced by numerical results, with top-five accuracy reaching up to 81.08% across the 1000-class subset. These results affirm the architecture's robustness in addressing the challenges of signer variability and complex real-world conditions.

Implications and Future Directions

The release of the MS-ASL dataset has several implications for both the practical and theoretical landscape of sign language recognition. Practically, the dataset facilitates the development and benchmarking of more generalized SLR systems capable of handling signer-independent variations. Theoretically, it encourages further exploration into the integration of multimodal cues like facial expressions and contextual LLMs to enhance recognition accuracy.

The research points towards future potential advancements, notably in areas such as integrating optical flow data to capture dynamic information and improving the extraction processes of hand key-points, which are crucial for differentiating subtle variations in sign execution. Additionally, the dataset establishes a foundation for developing pre-trained models specific to SLR tasks, which could serve as a basis for transfer learning, further enhancing performance in low-resource scenarios.

In conclusion, the MS-ASL dataset represents a significant contribution to the field of sign language recognition, providing a robust platform for advancing research and developing technologies that can bridge communication barriers faced by the Deaf community. The dataset is expected to catalyze further innovations in sign language recognition and related assistive technologies.

PDF Markdown