Neural Sign Language Translation based on Human Keypoint Estimation (1811.11436v2)

Published 28 Nov 2018 in cs.CV

Abstract: We propose a sign language translation system based on human keypoint estimation. It is well-known that many problems in the field of computer vision require a massive amount of dataset to train deep neural network models. The situation is even worse when it comes to the sign language translation problem as it is far more difficult to collect high-quality training data. In this paper, we introduce the KETI (short for Korea Electronics Technology Institute) sign language dataset which consists of 14,672 videos of high resolution and quality. Considering the fact that each country has a different and unique sign language, the KETI sign language dataset can be the starting line for further research on the Korean sign language translation. Using the KETI sign language dataset, we develop a neural network model for translating sign videos into natural language sentences by utilizing the human keypoints extracted from a face, hands, and body parts. The obtained human keypoint vector is normalized by the mean and standard deviation of the keypoints and used as input to our translation model based on the sequence-to-sequence architecture. As a result, we show that our approach is robust even when the size of the training data is not sufficient. Our translation model achieves 93.28% (55.28%, respectively) translation accuracy on the validation set (test set, respectively) for 105 sentences that can be used in emergency situations. We compare several types of our neural sign translation models based on different attention mechanisms in terms of classical metrics for measuring the translation performance.

Citations (190)

View on Semantic Scholar

Summary

Neural Sign Language Translation based on Human Keypoint Estimation

The research presented focuses on the development of a sign language translation system using human keypoint estimation. The essential contribution of this work is the introduction of the KETI sign language dataset, an extensive Korean Sign Language dataset featuring 14,672 high-resolution videos recorded in controlled environments to ensure the quality and consistency necessary for training advanced machine learning models. This dataset is intended to catalyze research in translating Korean sign language into natural language, acknowledging the unique characteristics of sign languages across different cultures.

The paper addresses the inherent challenge in sign language recognition, which demands deep understanding and precise interpretation of visual cues from signers, such as body movement and facial expressions. The complexity escalates given the limited availability of high-quality datasets crucial for training neural networks.

The approach employs human keypoint estimation to translate signs into textual sentences suitable for emergency situations, achieving a translation accuracy of 93.28% on a validation set and 55.28% on an unseen test set. Keypoint features extracted from the signer, including face, hands, and body parts, are preprocessed using normalization techniques to manage variance across different signers and environments. This preprocessing entails normalizing the keypoint vectors via mean and standard deviation calculations for improved robustness in translation accuracy across limited training data.

The paper compares different neural architectures, including traditional sequence-to-sequence (seq2seq) models and those employing attention mechanisms such as Bahdanau attention, Luong attention, and Transformers. The tests reveal that while the Transformer architecture generalizes effectively to new signers, the seq2seq model with Luong attention achieves optimal performance on validation data.

This research has several implications. Practically, the system holds the potential to enhance communication for the hearing-impaired, particularly in emergency contexts. It suggests pathways for developing real-time applications that can function efficiently with limited training data through feature extraction enhancements and selective preprocessing strategies.

Theoretically, this work expands methodologies in machine translation tasks under constrained dataset settings and highlights the potential for further refinement in feature extraction techniques, notably through improved human keypoint detection algorithms. Future directions involve expanding the dataset to include more signers and environments, assessing alternative keypoint detection methodologies, and leveraging augmented data for improved training outcomes.

Ultimately, this research underscores the nuanced complexity of sign language translation tasks, asserting the crucial role datasets play in advancing machine translation technologies and the expansion of accessible communication resources for underrepresented linguistic communities.

Neural Sign Language Translation based on Human Keypoint Estimation (1811.11436v2)

Summary

Neural Sign Language Translation based on Human Keypoint Estimation

Related Papers