Two Hands Are Better Than One: Resolving Hand to Hand Intersections via Occupancy Networks (2404.05414v1)
Abstract: 3D hand pose estimation from images has seen considerable interest from the literature, with new methods improving overall 3D accuracy. One current challenge is to address hand-to-hand interaction where self-occlusions and finger articulation pose a significant problem to estimation. Little work has applied physical constraints that minimize the hand intersections that occur as a result of noisy estimation. This work addresses the intersection of hands by exploiting an occupancy network that represents the hand's volume as a continuous manifold. This allows us to model the probability distribution of points being inside a hand. We designed an intersection loss function to minimize the likelihood of hand-to-point intersections. Moreover, we propose a new hand mesh parameterization that is superior to the commonly used MANO model in many respects including lower mesh complexity, underlying 3D skeleton extraction, watertightness, etc. On the benchmark InterHand2.6M dataset, the models trained using our intersection loss achieve better results than the state-of-the-art by significantly decreasing the number of hand intersections while lowering the mean per-joint positional error. Additionally, we demonstrate superior performance for 3D hand uplift on Re:InterHand and SMILE datasets and show reduced hand-to-hand intersections for complex domains such as sign-language pose estimation.
- BSL-1K: Scaling up co-articulated sign language recognition using mouthing cues. In European Conference on Computer Vision, 2020.
- BOBSL: BBC-Oxford British Sign Language Dataset. 2021.
- Motion capture of hands in action using discriminative salient points. In Computer Vision – ECCV 2012, pages 640–653, Berlin, Heidelberg, 2012. Springer Berlin Heidelberg.
- Interacting two-hand 3d pose and shape reconstruction from single color image. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 11334–11343, 2021.
- Pointnet: Deep learning on point sets for 3d classification and segmentation. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 77–85, 2017.
- Generating realistic training images based on tonality-alignment generative adversarial networks for hand pose estimation. ArXiv, abs/1811.09916, 2018.
- SMILE Swiss German sign language dataset. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, May 2018. European Language Resources Association (ELRA).
- Learning to disambiguate strongly interacting hands via probabilistic per-pixel part segmentation. In International Conference on 3D Vision (3DV), 2021.
- Keypoint transformer: Solving joint identification in challenging hands and object interactions for accurate 3d pose estimation. In IEEE Computer Vision and Pattern Recognition Conference, 2022.
- Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
- Improving 3d pose estimation for sign language. In 2023 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW), pages 1–5, 2023.
- A2j-transformer: Anchor-to-joint transformer network for 3d interacting hand pose estimation from a single rgb image, 2023.
- A skeleton-driven neural occupancy representation for articulated hands. In International Conference on 3D Vision (3DV), 2021.
- Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling multiple signers. Computer Vision and Image Understanding, 141:108–125, Dec. 2015.
- Interacting attention graph for single image two-hand reconstruction. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), June 2022.
- Findings of the first shared task on machine translation robustness. In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), pages 91–102, Florence, Italy, Aug. 2019. Association for Computational Linguistics.
- Mediapipe: A framework for building perception pipelines. CoRR, abs/1906.08172, 2019.
- 3d interacting hand pose estimation by hand de-occlusion and removal. October 2022.
- Occupancy networks: Learning 3d reconstruction in function space. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2019.
- A dataset of relighted 3D interacting hands. In NeurIPS Track on Datasets and Benchmarks, 2023.
- Interhand2.6m: A dataset and baseline for 3d interacting hand pose estimation from a single rgb image. In European Conference on Computer Vision (ECCV), 2020.
- Real-time pose and shape reconstruction of two interacting hands with a single depth camera. ACM Trans. Graph., 38(4), jul 2019.
- Tracking the articulated motion of two strongly interacting hands. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 1862–1869, 2012.
- Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc., 2019.
- Embodied hands: Modeling and capturing hands and bodies together. ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), 36(6), Nov. 2017.
- Monocular 3d reconstruction of interacting handsvia collision-aware factorized refinements. In International Conference on 3D Vision, 2021.
- S. D. Roth. Ray casting for modeling solids. Computer Graphics and Image Processing, 18(2):109–144, 1982.
- Constraining dense hand surface tracking with elasticity. ACM Trans. Graph., 39(6), nov 2020.
- Self-supervised 3d hand pose estimation from monocular rgb via contrastive learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11230–11239, 2021.
- Cross-modal deep variational hand pose estimation. In CVPR, 2018.
- Articulated distance fields for ultra-fast tracking of hands interacting. ACM Trans. Graph., 36(6), nov 2017.
- Rgb2hands: Real-time tracking of 3d hand interactions from monocular rgb video. ACM Trans. Graph., 39(6), nov 2020.
- A2j: Anchor-to-joint regression network for 3d articulated pose estimation from a single depth image. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 793–802, Los Alamitos, CA, USA, nov 2019. IEEE Computer Society.
- L. Yang and A. Yao. Disentangling latent hands for image synthesis and pose estimation. pages 9869–9878, 06 2019.
- Acr: Attention collaboration-based regressor for arbitrary two-hand reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2023.
- Interacting two-hand 3d pose and shape reconstruction from single color image. In International Conference on Computer Vision (ICCV), 2021.
- A simple, fast and highly-accurate algorithm to recover 3d shape from 2d landmarks on a single image, 2016.
- C. Zimmermann and T. Brox. Learning to estimate 3d hand pose from single rgb images. Technical report, arXiv:1705.01389, 2017. https://arxiv.org/abs/1705.01389.