Mesh Represented Recycle Learning for 3D Hand Pose and Mesh Estimation
Abstract: In general, hand pose estimation aims to improve the robustness of model performance in the real-world scenes. However, it is difficult to enhance the robustness since existing datasets are obtained in restricted environments to annotate 3D information. Although neural networks quantitatively achieve a high estimation accuracy, unsatisfied results can be observed in visual quality. This discrepancy between quantitative results and their visual qualities remains an open issue in the hand pose representation. To this end, we propose a mesh represented recycle learning strategy for 3D hand pose and mesh estimation which reinforces synthesized hand mesh representation in a training phase. To be specific, a hand pose and mesh estimation model first predicts parametric 3D hand annotations (i.e., 3D keypoint positions and vertices for hand mesh) with real-world hand images in the training phase. Second, synthetic hand images are generated with self-estimated hand mesh representations. After that, the synthetic hand images are fed into the same model again. Thus, the proposed learning strategy simultaneously improves quantitative results and visual qualities by reinforcing synthetic mesh representation. To encourage consistency between original model output and its recycled one, we propose self-correlation loss which maximizes the accuracy and reliability of our learning strategy. Consequently, the model effectively conducts self-refinement on hand pose estimation by learning mesh representation from its own output. To demonstrate the effectiveness of our learning strategy, we provide extensive experiments on FreiHAND dataset. Notably, our learning strategy improves the performance on hand pose and mesh estimation without any extra computational burden during the inference.
- Pushing the envelope for rgb-based dense 3d hand pose estimation via neural rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
- Cross-attention of disentangled modalities for 3d human mesh recovery with transformers. In European Conference on Computer Vision, pages 342–359. Springer, 2022.
- Pose2mesh: Graph convolutional network for 3d human pose and mesh recovery from a 2d human pose. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VII 16, pages 769–787. Springer, 2020.
- Blender Online Community. Blender - a 3D modelling and rendering package. Blender Foundation, Stichting Blender Foundation, Amsterdam, 2018.
- Bardia Doosti. Hand pose estimation: A survey. arXiv preprint arXiv:1903.01013, 2019.
- Large-scale multiview 3d hand pose dataset. Image and Vision Computing, 81:25–33, 2019.
- Contactopt: Optimizing contact to improve grasps. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1471–1481, 2021.
- Honnotate: A method for 3d annotation of hand and object poses. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3196–3206, 2020.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Survey on depth and rgb image-based 3d hand shape and pose estimation. Virtual Reality & Intelligent Hardware, 3(3):207–234, 2021.
- Hand pose estimation via latent 2.5d heatmap regression. In Proceedings of the European Conference on Computer Vision (ECCV), 2018.
- ultralytics/yolov5: Initial release. Zenodo, 2020.
- Total capture: A 3d deformation model for tracking faces, hands, and bodies. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8320–8329, 2018.
- Convolutional mesh regression for single-image human shape reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4501–4510, 2019.
- Mesh graphormer. In Proceedings of the IEEE/CVF international conference on computer vision, pages 12939–12948, 2021a.
- End-to-end human pose and mesh reconstruction with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1954–1963, 2021b.
- Semi-supervised 3d hand-object poses estimation with interactions in time. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14687–14697, 2021.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- I2l-meshnet: Image-to-lixel prediction network for accurate 3d human pose and mesh estimation from a single rgb image. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VII 16, pages 752–768. Springer, 2020.
- Ganerated hands for real-time 3d hand tracking from monocular rgb. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 49–59, 2018.
- Efficient annotation and learning for 3d hand pose estimation: a survey. International Journal of Computer Vision, pages 1–14, 2023.
- Learning to estimate 3d human pose and shape from a single color image. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 459–468, 2018.
- Embodied hands: Modeling and capturing hands and bodies together. arXiv preprint arXiv:2201.02610, 2022.
- Yutaka Sasaki et al. The truth of the f-measure. Teach tutor mater, 1(5):1–5, 2007.
- Hand keypoint detection in single images using multiview bootstrapping. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 1145–1153, 2017.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Dense 3d regression for hand pose estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5147–5156, 2018.
- Deep high-resolution representation learning for visual recognition. IEEE transactions on pattern analysis and machine intelligence, 43(10):3349–3364, 2020.
- Occlusion-aware hand pose estimation using hierarchical mixture density network. In Proceedings of the European conference on computer vision (ECCV), pages 801–817, 2018.
- Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365, 2015.
- 3d hand pose tracking and estimation using stereo matching. arXiv preprint arXiv:1610.07214, 2016.
- Monocap: Monocular human motion capture using a cnn coupled with a geometric prior. IEEE transactions on pattern analysis and machine intelligence, 41(4):901–914, 2018.
- Learning to estimate 3d hand pose from single rgb images. In Proceedings of the IEEE international conference on computer vision, pages 4903–4911, 2017.
- Freihand: A dataset for markerless capture of hand pose and shape from single rgb images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 813–822, 2019.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.