3D Interacting Hand Pose Estimation via Hand De-occlusion and Removal
The paper introduces an innovative approach to estimating 3D interacting hand poses from a single RGB image using a framework termed Hand De-occlusion and Removal (HDR). This problem is of significant importance for numerous applications, including human-computer interaction and augmented reality. The complexity arises chiefly from (1) occlusion during hand interactions and (2) ambiguities due to the similar appearance of left and right hands. Previous methods have predominantly approached the issue by directly predicting the 3D poses of both hands in tandem, which typically leads to challenges in handling occlusions and ambiguities.
HDR Framework
The authors propose decomposing the hand pose estimation task, estimating each hand separately, thus leveraging recent advancements in single-hand pose estimation. The HDR framework comprises three crucial components: Hand Amodal Segmentation Module (HASM), Hand De-occlusion and Removal Module (HDRM), and Single Hand Pose Estimator (SHPE).
- HASM generates both amodal and visible segmentation masks for each hand. These masks are critical for understanding occluded parts and distracting elements, essential for the subsequent de-occlusion and removal processes.
- HDRM focuses on mitigating occlusion and confusion caused by similar appearances through de-occlusion and removal processes. The module reconstructs occluded hand parts and removes the distracting hand to simplify the input to the hand pose estimator.
- SHPE is an existing single-hand pose estimator, which benefits from the simplified input, overcoming issues related to occlusion and ambiguity.
Quantitative evaluations demonstrate that the HDR framework significantly outperforms existing state-of-the-art methods. Specifically, it reduces the mean per-joint position error (MPJPE) considerably, showing marked improvements over traditional two-hand methods on the InterHand2.6M dataset, especially in challenging scenarios involving heavy interaction.
Creation and Utilization of Amodal InterHand Dataset (AIH)
This paper also introduces the Amodal InterHand Dataset (AIH), which is seminal in training the proposed modules. AIH is synthetically generated and comprises two sub-datasets: AIH_Syn and AIH_Render.
- AIH_Syn employs copy-and-paste techniques to generate realistic-looking hand interactions, preserving the biomechanical structure, albeit sometimes leading to unnatural hand positions.
- AIH_Render offers greater fidelity in physical interactions by rendering hand meshes, though it's prone to appearance gaps due to synthetic textures.
The combination of these datasets ensures comprehensive coverage of hand poses and interactions, facilitating robust model training.
Implications and Future Directions
The HDR framework and AIH dataset collectively push the boundaries of hand pose estimation amidst interaction and occlusion. The success of this approach hinges on effectively applying de-occlusion and removal cues, showcasing a novel use of amodal perception for practical purposes in 3D vision tasks. Future work could explore further improvements in image recovery quality and integrating more advanced modules as the building blocks for HDR, setting the stage for broader applications and advancements in AI interactions involving complex human gestures. This research is poised to influence future developments in AI, obstacle avoidance in robotics, and immersive AR/VR experiences. The ongoing challenge is to refine these models for better generalization across diverse environments, enhancing their applicability and robustness in unconstrained real-world scenarios.