Single-to-Dual-View Adaptation for Egocentric 3D Hand Pose Estimation (2403.04381v2)
Abstract: The pursuit of accurate 3D hand pose estimation stands as a keystone for understanding human activity in the realm of egocentric vision. The majority of existing estimation methods still rely on single-view images as input, leading to potential limitations, e.g., limited field-of-view and ambiguity in depth. To address these problems, adding another camera to better capture the shape of hands is a practical direction. However, existing multi-view hand pose estimation methods suffer from two main drawbacks: 1) Requiring multi-view annotations for training, which are expensive. 2) During testing, the model becomes inapplicable if camera parameters/layout are not the same as those used in training. In this paper, we propose a novel Single-to-Dual-view adaptation (S2DHand) solution that adapts a pre-trained single-view estimator to dual views. Compared with existing multi-view training methods, 1) our adaptation process is unsupervised, eliminating the need for multi-view annotation. 2) Moreover, our method can handle arbitrary dual-view pairs with unknown camera parameters, making the model applicable to diverse camera settings. Specifically, S2DHand is built on certain stereo constraints, including pair-wise cross-view consensus and invariance of transformation between both views. These two stereo constraints are used in a complementary manner to generate pseudo-labels, allowing reliable adaptation. Evaluation results reveal that S2DHand achieves significant improvements on arbitrary camera pairs under both in-dataset and cross-dataset settings, and outperforms existing adaptation methods with leading performance. Project page: https://github.com/MickeyLLG/S2DHand.
- Charles George Broyden. The convergence of a class of double-rank minimization algorithms 1. general considerations. IMA Journal of Applied Mathematics, 6(1):76–90, 1970.
- Generalizing hand segmentation in egocentric videos with uncertainty-guided model adaptation. In Proceedings of the ieee/cvf conference on computer vision and pattern recognition, pages 14392–14401, 2020.
- Unsupervised 3d pose estimation with geometric self-supervision. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5714–5724, 2019.
- Mvhm: A large-scale multi-view hand mesh benchmark for accurate 3d hand pose estimation. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 836–845, 2021.
- Scaling egocentric vision: The epic-kitchens dataset. In Proceedings of the European conference on computer vision (ECCV), pages 720–736, 2018.
- Unsupervised domain adaptation for person re-identification through source-guided pseudo-labeling. In 2020 25th International Conference on Pattern Recognition (ICPR), pages 4957–4964. IEEE, 2021.
- Deformer: Dynamic fusion transformer for robust hand pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 23600–23611, 2023.
- Physics-based dexterous manipulations with estimated hand poses and residual reinforcement learning. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 9561–9568. IEEE, 2020.
- Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
- Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18995–19012, 2022.
- Domain adaptation gaze estimation by embedding with prediction consistency. In Proceedings of the Asian Conference on Computer Vision, 2020.
- Umetrack: Unified multi-view end-to-end hand tracking for vr. In SIGGRAPH Asia 2022 Conference Papers, pages 1–9, 2022.
- Dexpilot: Vision-based teleoperation of dexterous robotic hand-arm system. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pages 9164–9170. IEEE, 2020.
- Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020.
- Efficient physics-based implementation for realistic hand-object interaction in virtual reality. In 2018 IEEE conference on virtual reality and 3D user interfaces (VR), pages 175–182. IEEE, 2018.
- Semi-supervised 2d human pose estimation driven by position inconsistency pseudo label correction module. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 693–703, 2023.
- Regressive domain adaptation for unsupervised keypoint detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6780–6789, 2021.
- Wolfgang Kabsch. A discussion of the solution for the best rotation to relate two sets of vectors. Acta Crystallographica Section A: Crystal Physics, Diffraction, Theoretical and General Crystallography, 34(5):827–828, 1978.
- Multi-view video-based 3d hand pose estimation. IEEE Transactions on Artificial Intelligence, 2022.
- Source data-absent unsupervised domain adaptation through hypothesis transfer and labeling transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(11):8602–8617, 2021.
- Cross-domain 3d hand pose estimation with dual modalities. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17184–17193, 2023.
- Multi-view joint learning and bev feature-fusion network for 3d object detection. Applied Sciences, 13(9):5274, 2023.
- Jitter does matter: Adapting gaze estimation to new domains. arXiv preprint arXiv:2210.02082, 2022.
- Semi-supervised 3d hand-object poses estimation with interactions in time. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14687–14697, 2021a.
- Generalizing gaze estimation with outlier-guided collaborative adaptation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3835–3844, 2021b.
- Ganerated hands for real-time 3d hand tracking from monocular rgb. In Proceedings of Computer Vision and Pattern Recognition (CVPR), 2018.
- Domain adaptive hand keypoint and pixel localization in the wild. In European Conference on Computer Vision, pages 68–87. Springer, 2022.
- Assemblyhands: Towards egocentric activity understanding via 3d hand pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12999–13008, 2023.
- Visual interpretation of hand gestures for human-computer interaction: A review. IEEE Transactions on pattern analysis and machine intelligence, 19(7):677–695, 1997.
- Source-free domain adaptive human pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4826–4836, 2023.
- User-defined gestures for augmented reality. In CHI’13 Extended Abstracts on Human Factors in Computing Systems, pages 955–960. 2013.
- Vision based hand gesture recognition for human computer interaction: a survey. Artificial intelligence review, 43:1–54, 2015.
- Prior-guided source-free domain adaptation for human pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14996–15006, 2023.
- Adversarial discriminative domain adaptation. In CVPR, 2017.
- Prefrontal involvement in imitation learning of hand actions: effects of practice and expertise. Neuroimage, 37(4):1371–1383, 2007.
- Bare-hand human-computer interaction. In Proceedings of the 2001 workshop on Perceptive user interfaces, pages 1–8, 2001.
- Hierarchical temporal transformer for 3d hand pose estimation and action recognition from egocentric rgb videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21243–21253, 2023.
- Semi-supervised stereo-based 3d object detection via cross-view consensus. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17471–17481, 2023.
- Multi-view feature fusion for person re-identification. Knowledge-Based Systems, 229:107344, 2021.
- Semihand: Semi-supervised hand pose estimation with consistency. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11364–11373, 2021.
- Monocular real-time hand shape and motion capture using multi-modal data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5346–5355, 2020.
- Learning to estimate 3d hand pose from single rgb images. Technical report, arXiv:1705.01389, 2017. https://arxiv.org/abs/1705.01389.