Real-time Hand Gesture Detection and Classification Using Convolutional Neural Networks (1901.10323v3)

Published 29 Jan 2019 in cs.CV and cs.AI

Abstract: Real-time recognition of dynamic hand gestures from video streams is a challenging task since (i) there is no indication when a gesture starts and ends in the video, (ii) performed gestures should only be recognized once, and (iii) the entire architecture should be designed considering the memory and power budget. In this work, we address these challenges by proposing a hierarchical structure enabling offline-working convolutional neural network (CNN) architectures to operate online efficiently by using sliding window approach. The proposed architecture consists of two models: (1) A detector which is a lightweight CNN architecture to detect gestures and (2) a classifier which is a deep CNN to classify the detected gestures. In order to evaluate the single-time activations of the detected gestures, we propose to use Levenshtein distance as an evaluation metric since it can measure misclassifications, multiple detections, and missing detections at the same time. We evaluate our architecture on two publicly available datasets - EgoGesture and NVIDIA Dynamic Hand Gesture Datasets - which require temporal detection and classification of the performed hand gestures. ResNeXt-101 model, which is used as a classifier, achieves the state-of-the-art offline classification accuracy of 94.04% and 83.82% for depth modality on EgoGesture and NVIDIA benchmarks, respectively. In real-time detection and classification, we obtain considerable early detections while achieving performances close to offline operation. The codes and pretrained models used in this work are publicly available.

Authors (4)

Okan Köpüklü (18 papers)
Ahmet Gunduz (22 papers)
Neslihan Kose (8 papers)
Gerhard Rigoll (49 papers)

Citations (183)

View on Semantic Scholar

Summary

Real-time Hand Gesture Detection and Classification Using Convolutional Neural Networks

The paper proposes a comprehensive approach for real-time hand gesture recognition, addressing the inherent challenges posed by continuous video streams. The task of recognizing dynamic hand gestures is complex due to the absence of explicit gesture boundaries, the need for single-time activations, and constraints imposed by memory and power budgets. This research introduces a hierarchical architecture utilizing Convolutional Neural Networks (CNNs) to effectively tackle these challenges, emphasizing both efficiency and high performance.

The architecture consists of two integral components: a gesture detector and a gesture classifier. The detector employs a lightweight CNN model designed to operate with high efficiency, ensuring minimal resource consumption. It acts as a switch, activating the deeper and more resource-intensive classifier only when a gesture is detected in the video stream. This layered setup allows the architecture to remain resource-efficient while still providing robust gesture recognition capability.

One of the salient aspects of the paper is the use of Levenshtein distance as an evaluation metric. This metric provides a nuanced approach to measurement, simultaneously accounting for misclassifications, multiple detections, and missed detections—a significant improvement over traditional accuracy metrics which often overlook the practical challenges of real-time applications.

The experiments conducted using the EgoGesture and NVIDIA Dynamic Hand Gesture Datasets demonstrate the architecture’s proficiency in terms of both detection and classification. The proposed system achieves state-of-the-art performance with 94.04% and 83.82% classification accuracy for the depth modality on the respective datasets using ResNeXt-101 as the classifier. These results are particularly noteworthy given the constraints of real-time operation.

The architecture also addresses the issue of early gesture detection—a critical requirement for interactive systems—by integrating a confidence measure that enables the system to predict gestures from as early as the nucleus part. By employing weighted averaging on class scores, the system reduces ambiguities that typically arise in the initial phases of gesture performance, facilitating timely recognition of gestures.

From a theoretical perspective, this research contributes significantly to the field of real-time gesture recognition, particularly in the context of human-machine interaction. The two-model hierarchical system exemplifies a scalable solution that can be adapted to various recognition tasks. Practically, it holds potential for numerous applications where gesture recognition is pivotal, such as augmented reality and gesture-based control interfaces.

Future developments in this domain might explore further optimization of the detector and classifier, focusing on reducing latency without compromising accuracy. Additionally, extending this model to accommodate other modalities, such as audio-visual signals, could broaden its applicability. The research opens avenues for incorporating advanced methods like attention mechanisms to further enhance the robustness of gesture recognition systems.

The work is notable for its grounded, numerical methodology, avoiding hyperbolic claims and providing a pragmatic solution to real-time gesture recognition challenges. The availability of the implementation codes further facilitates replication and subsequent advancements by researchers in the field. As AI continues to evolve, this hierarchical approach offers a blueprint for future developments in efficient real-time hand gesture recognition systems.

PDF Markdown

Related Papers

Find Related Papers