Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MediaPipe Hands: On-device Real-time Hand Tracking (2006.10214v1)

Published 18 Jun 2020 in cs.CV

Abstract: We present a real-time on-device hand tracking pipeline that predicts hand skeleton from single RGB camera for AR/VR applications. The pipeline consists of two models: 1) a palm detector, 2) a hand landmark model. It's implemented via MediaPipe, a framework for building cross-platform ML solutions. The proposed model and pipeline architecture demonstrates real-time inference speed on mobile GPUs and high prediction quality. MediaPipe Hands is open sourced at https://mediapipe.dev.

Citations (657)

Summary

  • The paper presents a two-stage model combining a palm detector and hand landmark regressor for efficient real-time tracking on mobile devices.
  • It employs a mix of real-world, synthetic, and gesture datasets, achieving a 13.4% MSE normalized by palm size to enhance accuracy and reduce jitter.
  • The architecture supports AR/VR integration with GPU acceleration and modular design, enabling dynamic gesture recognition and augmented reality effects.

MediaPipe Hands: On-Device Real-Time Hand Tracking

This paper presents MediaPipe Hands, a sophisticated pipeline for real-time hand tracking utilizing only a single RGB camera, primarily targeting AR/VR applications. The research discusses an ML-based architecture designed to function efficiently on mobile GPUs, offering a practical solution without necessitating specialized hardware, thus broadening its applicability across various consumer devices.

Architecture Overview

The proposed system comprises a two-part model architecture:

  1. Palm Detector: This model processes the entire image to produce a bounding box around the palm. By focusing on palms instead of hands, the system simplifies the detection problem. The palm presents a more consistent and rigid target compared to the articulated nature of a full hand, allowing for a reduced complexity in detection.
  2. Hand Landmark Model: Following the palm detection, this model predicts precise 2.5D hand landmarks within the detected bounding box. It achieves robust tracking through regression, capable of accurately determining hand posture even under occlusions or partial visibility.

Notably, the architecture optimizes real-time processing by minimizing repeated computations and only re-engaging the palm detector when necessary, specifically if tracking confidence drops.

Data and Training

A combination of real-world and synthetic datasets was employed to train the pipeline:

  • In-the-Wild Dataset: Comprises diverse real-world images, providing varied hand appearances and environments.
  • Gesture Dataset: Collected in-house, focusing on a wider range of hand gestures but limited in participant diversity.
  • Synthetic Dataset: Utilizes high-quality 3D models to cover exhaustive poses and depth supervision, thus enhancing the robustness of the landmark predictions.

The mixture of datasets aids in improving prediction accuracy and minimizes frame-to-frame jitter, a common challenge in real-time tracking systems.

Results

Table \ref{tab:datasets} demonstrates the combined dataset's superiority, achieving a 13.4% MSE normalized by palm size compared to using real-world or synthetic data in isolation. Furthermore, the paper reveals that employing a model with optimal parameterization (referred to as the "Full" model) achieves a balance between speed and precision, particularly on mobile hardware such as Pixel 3 and iPhone 11.

Implementation and Applications

MediaPipe's modular architecture facilitates the development of cross-platform applications via a directed graph of components, optimized for GPU acceleration. This allows for fluid integration into existing systems, notably in applications like gesture recognition and augmented reality effects.

By estimating finger states and mapping them to predefined gestures, the system effectively identifies static gestures. Additionally, by tracking landmark sequences, dynamic gesture prediction becomes feasible. The versatility of the system is further highlighted by its ability to render AR effects, demonstrated with an example of hand skeletons styled in neon light.

Implications and Future Directions

MediaPipe Hands is significant for its scalable approach to hand tracking across various consumer devices without additional hardware requirements. This expands potential applications in interactive AR/VR experiences and gesture-controlled interfaces. Its open-source nature encourages further innovation and adaptation within the research community.

Moving forward, refinements in model efficiency could enhance accuracy and latency, perhaps through adaptive learning techniques or advanced neural architectures. Additionally, extending tracking capabilities to include more complex gestures or multi-hand interactions could unlock new possibilities in human-computer interaction domains.

In summary, MediaPipe Hands offers a compelling solution for real-time hand tracking, representing a notable advancement in on-device computer vision applications with broad practical implications. The paper elucidates a scalable methodology, supported by robust empirical results, which is likely to catalyze further advancements in the field.