Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Single-to-Dual-View Adaptation for Egocentric 3D Hand Pose Estimation (2403.04381v2)

Published 7 Mar 2024 in cs.CV

Abstract: The pursuit of accurate 3D hand pose estimation stands as a keystone for understanding human activity in the realm of egocentric vision. The majority of existing estimation methods still rely on single-view images as input, leading to potential limitations, e.g., limited field-of-view and ambiguity in depth. To address these problems, adding another camera to better capture the shape of hands is a practical direction. However, existing multi-view hand pose estimation methods suffer from two main drawbacks: 1) Requiring multi-view annotations for training, which are expensive. 2) During testing, the model becomes inapplicable if camera parameters/layout are not the same as those used in training. In this paper, we propose a novel Single-to-Dual-view adaptation (S2DHand) solution that adapts a pre-trained single-view estimator to dual views. Compared with existing multi-view training methods, 1) our adaptation process is unsupervised, eliminating the need for multi-view annotation. 2) Moreover, our method can handle arbitrary dual-view pairs with unknown camera parameters, making the model applicable to diverse camera settings. Specifically, S2DHand is built on certain stereo constraints, including pair-wise cross-view consensus and invariance of transformation between both views. These two stereo constraints are used in a complementary manner to generate pseudo-labels, allowing reliable adaptation. Evaluation results reveal that S2DHand achieves significant improvements on arbitrary camera pairs under both in-dataset and cross-dataset settings, and outperforms existing adaptation methods with leading performance. Project page: https://github.com/MickeyLLG/S2DHand.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. Charles George Broyden. The convergence of a class of double-rank minimization algorithms 1. general considerations. IMA Journal of Applied Mathematics, 6(1):76–90, 1970.
  2. Generalizing hand segmentation in egocentric videos with uncertainty-guided model adaptation. In Proceedings of the ieee/cvf conference on computer vision and pattern recognition, pages 14392–14401, 2020.
  3. Unsupervised 3d pose estimation with geometric self-supervision. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5714–5724, 2019.
  4. Mvhm: A large-scale multi-view hand mesh benchmark for accurate 3d hand pose estimation. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 836–845, 2021.
  5. Scaling egocentric vision: The epic-kitchens dataset. In Proceedings of the European conference on computer vision (ECCV), pages 720–736, 2018.
  6. Unsupervised domain adaptation for person re-identification through source-guided pseudo-labeling. In 2020 25th International Conference on Pattern Recognition (ICPR), pages 4957–4964. IEEE, 2021.
  7. Deformer: Dynamic fusion transformer for robust hand pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 23600–23611, 2023.
  8. Physics-based dexterous manipulations with estimated hand poses and residual reinforcement learning. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 9561–9568. IEEE, 2020.
  9. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
  10. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18995–19012, 2022.
  11. Domain adaptation gaze estimation by embedding with prediction consistency. In Proceedings of the Asian Conference on Computer Vision, 2020.
  12. Umetrack: Unified multi-view end-to-end hand tracking for vr. In SIGGRAPH Asia 2022 Conference Papers, pages 1–9, 2022.
  13. Dexpilot: Vision-based teleoperation of dexterous robotic hand-arm system. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pages 9164–9170. IEEE, 2020.
  14. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020.
  15. Efficient physics-based implementation for realistic hand-object interaction in virtual reality. In 2018 IEEE conference on virtual reality and 3D user interfaces (VR), pages 175–182. IEEE, 2018.
  16. Semi-supervised 2d human pose estimation driven by position inconsistency pseudo label correction module. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 693–703, 2023.
  17. Regressive domain adaptation for unsupervised keypoint detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6780–6789, 2021.
  18. Wolfgang Kabsch. A discussion of the solution for the best rotation to relate two sets of vectors. Acta Crystallographica Section A: Crystal Physics, Diffraction, Theoretical and General Crystallography, 34(5):827–828, 1978.
  19. Multi-view video-based 3d hand pose estimation. IEEE Transactions on Artificial Intelligence, 2022.
  20. Source data-absent unsupervised domain adaptation through hypothesis transfer and labeling transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(11):8602–8617, 2021.
  21. Cross-domain 3d hand pose estimation with dual modalities. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17184–17193, 2023.
  22. Multi-view joint learning and bev feature-fusion network for 3d object detection. Applied Sciences, 13(9):5274, 2023.
  23. Jitter does matter: Adapting gaze estimation to new domains. arXiv preprint arXiv:2210.02082, 2022.
  24. Semi-supervised 3d hand-object poses estimation with interactions in time. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14687–14697, 2021a.
  25. Generalizing gaze estimation with outlier-guided collaborative adaptation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3835–3844, 2021b.
  26. Ganerated hands for real-time 3d hand tracking from monocular rgb. In Proceedings of Computer Vision and Pattern Recognition (CVPR), 2018.
  27. Domain adaptive hand keypoint and pixel localization in the wild. In European Conference on Computer Vision, pages 68–87. Springer, 2022.
  28. Assemblyhands: Towards egocentric activity understanding via 3d hand pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12999–13008, 2023.
  29. Visual interpretation of hand gestures for human-computer interaction: A review. IEEE Transactions on pattern analysis and machine intelligence, 19(7):677–695, 1997.
  30. Source-free domain adaptive human pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4826–4836, 2023.
  31. User-defined gestures for augmented reality. In CHI’13 Extended Abstracts on Human Factors in Computing Systems, pages 955–960. 2013.
  32. Vision based hand gesture recognition for human computer interaction: a survey. Artificial intelligence review, 43:1–54, 2015.
  33. Prior-guided source-free domain adaptation for human pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14996–15006, 2023.
  34. Adversarial discriminative domain adaptation. In CVPR, 2017.
  35. Prefrontal involvement in imitation learning of hand actions: effects of practice and expertise. Neuroimage, 37(4):1371–1383, 2007.
  36. Bare-hand human-computer interaction. In Proceedings of the 2001 workshop on Perceptive user interfaces, pages 1–8, 2001.
  37. Hierarchical temporal transformer for 3d hand pose estimation and action recognition from egocentric rgb videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21243–21253, 2023.
  38. Semi-supervised stereo-based 3d object detection via cross-view consensus. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17471–17481, 2023.
  39. Multi-view feature fusion for person re-identification. Knowledge-Based Systems, 229:107344, 2021.
  40. Semihand: Semi-supervised hand pose estimation with consistency. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11364–11373, 2021.
  41. Monocular real-time hand shape and motion capture using multi-modal data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5346–5355, 2020.
  42. Learning to estimate 3d hand pose from single rgb images. Technical report, arXiv:1705.01389, 2017. https://arxiv.org/abs/1705.01389.
Citations (4)

Summary

We haven't generated a summary for this paper yet.