Video-based Sign Language Recognition without Temporal Segmentation (1801.10111v1)

Published 30 Jan 2018 in cs.CV

Abstract: Millions of hearing impaired people around the world routinely use some variants of sign languages to communicate, thus the automatic translation of a sign language is meaningful and important. Currently, there are two sub-problems in Sign Language Recognition (SLR), i.e., isolated SLR that recognizes word by word and continuous SLR that translates entire sentences. Existing continuous SLR methods typically utilize isolated SLRs as building blocks, with an extra layer of preprocessing (temporal segmentation) and another layer of post-processing (sentence synthesis). Unfortunately, temporal segmentation itself is non-trivial and inevitably propagates errors into subsequent steps. Worse still, isolated SLR methods typically require strenuous labeling of each word separately in a sentence, severely limiting the amount of attainable training data. To address these challenges, we propose a novel continuous sign recognition framework, the Hierarchical Attention Network with Latent Space (LS-HAN), which eliminates the preprocessing of temporal segmentation. The proposed LS-HAN consists of three components: a two-stream Convolutional Neural Network (CNN) for video feature representation generation, a Latent Space (LS) for semantic gap bridging, and a Hierarchical Attention Network (HAN) for latent space based recognition. Experiments are carried out on two large scale datasets. Experimental results demonstrate the effectiveness of the proposed framework.

Authors (5)

Jie Huang (155 papers)
Wengang Zhou (153 papers)
Qilin Zhang (15 papers)
Houqiang Li (236 papers)
Weiping Li (39 papers)

Citations (385)

View on Semantic Scholar

Summary

Video-based Sign Language Recognition without Temporal Segmentation

The paper "Video-based Sign Language Recognition without Temporal Segmentation" addresses a key challenge in continuous Sign Language Recognition (SLR) by introducing a novel framework known as the Hierarchical Attention Network with Latent Space (LS-HAN). The paper responds to the pervasive difficulty of temporal segmentation in SLR tasks, aiming to improve on existing methods that traditionally utilize isolated SLR as their cornerstone, which can be impaired by costly and often erroneous temporal segmentation processes.

Framework Overview

The LS-HAN framework consists of three fundamental components: a two-stream 3D Convolutional Neural Network (CNN), a Latent Space (LS) module, and a Hierarchical Attention Network (HAN). The two-stream 3D CNN is crucial for generating comprehensive video feature representations that encapsulate both global and localized information. This dual approach allows the framework to capture intricate hand gestures and upper body movements central to accurate sign language recognition. The global stream accounts for overall hand motion, while the local stream is finely tuned for precise gesture detection and tracking via trained models on both the holistic frame and segmented hand regions.

The Latent Space module plays a vital role in bridging the semantic gap between video and textual descriptions of sign language, enhancing the robustness of the recognition model by precluding the need for labor-intensive and time-consuming frame-wise annotations. It effectively aligns the temporal structure of signing activities with corresponding sentences using techniques like Dynamic Time Warping (DTW), focusing on retaining the temporal alignment within the latent space itself.

The Hierarchical Attention Network builds upon Long Short-Term Memory (LSTM) models with an added attention mechanism to sequentially generate the corresponding sign language sentence, word-by-word, without relying on predefined segmentation. This structure optimizes word prediction by weighting input sequences in its encoder, thus efficiently processing continuous sign language data.

Experimental Validation

The efficacy of the proposed LS-HAN framework is evaluated on two large-scale datasets: the Modern Chinese Sign Language (CSL) dataset and the RWTH-PHOENIX-Weather dataset for German sign language. When compared to established methods like LSTM, S2VT, and various Conditional Random Fields-based models, the LS-HAN framework demonstrates noteworthy improvement. The sentence accuracy achieved, particularly with empirical alignment strategy (c), signifies a substantial advancement in continuous SLR accuracy by obviating segmentation-related inaccuracies.

Implications and Future Directions

The success of the LS-HAN in SLR tasks implies broad implications for practical applications in real-time sign language translation systems, particularly in scenarios where quick and robust interpretation of sign languages is required. From a theoretical standpoint, the novel use of hierarchical attention within a latent space offers promising avenues for further exploration in other temporal sequence recognition tasks. Potential developments could include scaling the approach to handle more complex sentential structures and integrating more diverse linguistic datasets to enhance its applicability across various sign languages.

In conclusion, this paper presents a forward-thinking approach to continuous sign language recognition, innovatively employing hierarchical attention and latent space synergy, achieving enhanced accuracy and efficiency by liberating models from the constraints of temporal segmentation. These advances mark a significant step forward in the deployment of SLR technologies in real-world, multimodal communication environments.

PDF Markdown