Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
162 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

OpenHands: Making Sign Language Recognition Accessible with Pose-based Pretrained Models across Languages (2110.05877v1)

Published 12 Oct 2021 in cs.CL, cs.AI, cs.CV, and cs.LG

Abstract: AI technologies for Natural Languages have made tremendous progress recently. However, commensurate progress has not been made on Sign Languages, in particular, in recognizing signs as individual words or as complete sentences. We introduce OpenHands, a library where we take four key ideas from the NLP community for low-resource languages and apply them to sign languages for word-level recognition. First, we propose using pose extracted through pretrained models as the standard modality of data to reduce training time and enable efficient inference, and we release standardized pose datasets for 6 different sign languages - American, Argentinian, Chinese, Greek, Indian, and Turkish. Second, we train and release checkpoints of 4 pose-based isolated sign language recognition models across all 6 languages, providing baselines and ready checkpoints for deployment. Third, to address the lack of labelled data, we propose self-supervised pretraining on unlabelled data. We curate and release the largest pose-based pretraining dataset on Indian Sign Language (Indian-SL). Fourth, we compare different pretraining strategies and for the first time establish that pretraining is effective for sign language recognition by demonstrating (a) improved fine-tuning performance especially in low-resource settings, and (b) high crosslingual transfer from Indian-SL to few other sign languages. We open-source all models and datasets in OpenHands with a hope that it makes research in sign languages more accessible, available here at https://github.com/AI4Bharat/OpenHands .

Citations (49)

Summary

  • The paper introduces pose estimation as a pivotal modality, similar to using BERT in NLP, which reduces training time and boosts inference efficiency.
  • It benchmarks models like LSTM, Transformer, and graph-based networks (ST-GCN, SL-GCN) across datasets for six diverse sign languages to set a new research baseline.
  • Self-supervised pretraining—especially Predictive Coding—is highlighted as a key strategy for improving crosslingual sign language recognition and enabling real-time applications.

Overview of "OpenHands: Making Sign Language Recognition Accessible with Pose-based Pretrained Models across Languages"

In recent years, advancements in AI technologies for natural languages have surged forward. However, similar progress in Automatic Sign Language Recognition (SLR), particularly with low-resource datasets and for diverse sign languages, has been limited. The paper "OpenHands: Making Sign Language Recognition Accessible with Pose-based Pretrained Models across Languages" introduces significant strides toward addressing this disparity by implementing, standardizing, and making accessible a range of pose-based models, training datasets, and pretraining strategies across multiple sign languages.

Key Contributions

  1. Pose Estimation as a Standard Modality: The paper highlights the utility of pose extracted through pretrained models as a pivotal modality for SLR, akin to using a pretrained encoder like BERT in NLP. This approach significantly reduces training time and improves inference efficiency. datasets formatted with pose features for six sign languages are provided: American, Argentinian, Chinese, Greek, Indian, and Turkish.
  2. Standardized Evaluation Across Languages: By benchmarking four models—LSTM, Transformer, ST-GCN, and SL-GCN—against seven datasets, the paper establishes a baseline for further ISLR research. These benchmarks facilitate an objective comparison across languages and model types.
  3. Self-supervised Pretraining: To address the constraints of labeled data scarcity in sign language corpora, the authors propose self-supervised pretraining strategies. They curate and release an extensive pose-based dataset (1,129 hours) for pretraining on the Indian Sign Language.
  4. Comparing Pretraining Strategies: The researchers evaluate multiple pretraining paradigms, including Masking-based, Contrastive-learning-based, and Predictive Coding. Notably, Predictive Coding emerges as a successful strategy, enhancing in-language performance and enabling significant crosslingual transfer capabilities.

Implications

The practical implications of this work are extensive. The research underscores that graph-based models such as ST-GCN and SL-GCN provide state-of-the-art performance on SLR tasks, achieving real-time inference capabilities. This characteristic is vital for deploying interactive, real-world applications supporting sign language users, such as video conferencing tools and communication apps.

Theoretically, the paper establishes that pretraining strategies common in NLP can be effectively adapted to the complexity of ISLR tasks. This offers a framework for leveraging large but unlabeled sign language video datasets, promoting future crosslingual generalization across numerous less-studied sign languages.

Future Research Directions

The findings and resources from this paper provide a strong foundation for various future research directions. Future work may include expanding the datasets to incorporate more sign languages, refining pose extraction models like HRNet, enhancing crosslingual performance further, and extending the methodology to Continuous Sign Language Recognition (CSLR). Moreover, there's potential to improve the deployment utility of the models by integrating them into real-time tools, perhaps incorporating quantized inference.

In conclusion, by offering open access to comprehensive datasets and models through the OpenHands library, this work facilitates a step toward the democratization of ISLR research. Embracing the diverse and global tapestry of sign languages, OpenHands promotes the development of accessible, real-time SLR applications, significantly enhancing the communicative capabilities of the Deaf and hard-of-hearing communities worldwide.