Bringing Inputs to Shared Domains for 3D Interacting Hands Recovery in the Wild

Published 23 Mar 2023 in cs.CV | (2303.13652v2)

Abstract: Despite recent achievements, existing 3D interacting hands recovery methods have shown results mainly on motion capture (MoCap) environments, not on in-the-wild (ITW) ones. This is because collecting 3D interacting hands data in the wild is extremely challenging, even for the 2D data. We present InterWild, which brings MoCap and ITW samples to shared domains for robust 3D interacting hands recovery in the wild with a limited amount of ITW 2D/3D interacting hands data. 3D interacting hands recovery consists of two sub-problems: 1) 3D recovery of each hand and 2) 3D relative translation recovery between two hands. For the first sub-problem, we bring MoCap and ITW samples to a shared 2D scale space. Although ITW datasets provide a limited amount of 2D/3D interacting hands, they contain large-scale 2D single hand data. Motivated by this, we use a single hand image as an input for the first sub-problem regardless of whether two hands are interacting. Hence, interacting hands of MoCap datasets are brought to the 2D scale space of single hands of ITW datasets. For the second sub-problem, we bring MoCap and ITW samples to a shared appearance-invariant space. Unlike the first sub-problem, 2D labels of ITW datasets are not helpful for the second sub-problem due to the 3D translation's ambiguity. Hence, instead of relying on ITW samples, we amplify the generalizability of MoCap samples by taking only a geometric feature without an image as an input for the second sub-problem. As the geometric feature is invariant to appearances, MoCap and ITW samples do not suffer from a huge appearance gap between the two datasets. The code is publicly available at https://github.com/facebookresearch/InterWild.

Abstract PDF Upgrade to Chat

Authors (1)

Gyeongsik Moon

Citations (23)

View on Semantic Scholar

Summary

The paper introduces the InterWild framework, which harmonizes MoCap and in-the-wild inputs by mapping them to a shared domain for robust 3D hand recovery.
It employs a shared 2D scale space and geometric feature extraction to normalize data and reliably estimate individual 3D hand meshes and relative translations.
Experimental results show that InterWild outperforms state-of-the-art methods on key datasets, improving metrics like MPJPE, MPVPE, and MRRPE in real-world scenarios.

Overview of "Bringing Inputs to Shared Domains for 3D Interacting Hands Recovery in the Wild"

The paper "Bringing Inputs to Shared Domains for 3D Interacting Hands Recovery in the Wild" presents a novel approach to overcoming the challenges associated with estimating 3D meshes of interacting hands in real-world, unstructured environments. The traditional approach of training models on Motion Capture (MoCap) datasets and deploying them in-the-wild (ITW) often results in performance degradation due to significant domain gaps, particularly in terms of appearance and context variability. This research introduces "InterWild," a framework that harmonizes the MoCap and ITW data domains, thereby enabling robust 3D interacting hands recovery even with limited ITW data.

Key Contributions

InterWild Framework: The proposed framework is designed to handle two main sub-problems in 3D interacting hand recovery: individual hand 3D recovery and the 3D relative translation between hands. InterWild effectively brings the inputs from MoCap and ITW to shared domains to address these challenges.
Shared 2D Scale Space: For recovering each hand's 3D geometry, the method uses single-hand inputs from both MoCap and ITW datasets. By treating interacting hands as individual entities, interaction scenes from MoCap datasets are normalized to the scale of single hands observed in ITW datasets. This normalization aligns the data to a shared 2D scale space, making the large availability of ITW single-hand data significantly more useful.
Geometric Feature Utilization: For the 3D relative translation estimation, the study deploys geometric features that are invariant to visual appearances. This approach mitigates the appearance gap between MoCap and ITW samples, achieving a shared appearance-invariant feature space without relying on ITW data, which often lacks clarity in 3D translation labels due to inherent ambiguities.

Methodology and Results

The InterWild model is composed of three components: DetectNet, SHNet, and TransNet. DetectNet identifies hand regions, SHNet predicts 3D mesh and 2.5D poses, and TransNet estimates 3D translations using geometric features.

Experimental Evaluation: The authors demonstrated the effectiveness of InterWild with rigorous quantitative and qualitative analyses. Notably, the framework achieved superior performance on the Hands In Action dataset, significantly improving on metrics like Mean Per-Joint Position Error (MPJPE), Mean Per-Vertex Position Error (MPVPE), and Mean Relative-Root Position Error (MRRPE), especially compared to previous methods that do not exploit shared domain strategies.
Quantitative Validation: On HIC and IH2.6M datasets, InterWild outperformed existing state-of-the-art methods, with particular improvements visible on the ITW dataset, highlighting its generalizability and robustness to diverse environmental conditions.

Practical and Theoretical Implications

Practical Impact: The work holds practical significance for applications in virtual reality (VR) and augmented reality (AR) where accurate hand interaction modeling is crucial. By bridging the domain gap with a novel approach, the system can reliably operate in diverse backgrounds and lighting conditions found in real-world applications.
Theoretical Insights: The integration of shared domain methodologies can inspire further research into domain adaptation techniques for other 3D recovery tasks in varied environments. Future directions could explore integration with whole-body motion systems and further refinement of geometric feature representations.

Conclusion

The presented approach is a significant step forward in reliable 3D interacting hand recovery in unconstrained environments. By leveraging shared domains and geometric features, InterWild sets a new benchmark for robustness and accuracy in ITW conditions. The publicly available source code supports ongoing research and development in the field, fostering enhanced understanding and innovation for complex interaction modeling.

Markdown Report Issue