Self-supervised Optimization of Hand Pose Estimation using Anatomical Features and Iterative Learning (2307.03007v1)
Abstract: Manual assembly workers face increasing complexity in their work. Human-centered assistance systems could help, but object recognition as an enabling technology hinders sophisticated human-centered design of these systems. At the same time, activity recognition based on hand poses suffers from poor pose estimation in complex usage scenarios, such as wearing gloves. This paper presents a self-supervised pipeline for adapting hand pose estimation to specific use cases with minimal human interaction. This enables cheap and robust hand posebased activity recognition. The pipeline consists of a general machine learning model for hand pose estimation trained on a generalized dataset, spatial and temporal filtering to account for anatomical constraints of the hand, and a retraining step to improve the model. Different parameter combinations are evaluated on a publicly available and annotated dataset. The best parameter and model combination is then applied to unlabelled videos from a manual assembly scenario. The effectiveness of the pipeline is demonstrated by training an activity recognition as a downstream task in the manual assembly scenario.
- O. Sand, S. Büttner, V. Paelke, and C. Röcker, “smart.assembly – projection-based augmented reality for supporting assembly workers,” in Virtual, Augmented and Mixed Reality, S. Lackey and R. Shumaker, Eds. Cham: Springer International Publishing, 2016, pp. 643–652.
- A. Riedel, J. Gerlach, M. Dietsch, S. Herbst, F. Engelmann, N. Brehm, and T. Pfeifroth, “A deep learning-based worker assistance system for error prevention: Case study in a real-world manual assembly,” Advances in Production Engineering & Management, vol. 16, no. 4, pp. 393–404, 2021.
- R. Sochor, L. Kraus, L. Merkel, S. Braunreuther, and G. Reinhart, “Approach to increase worker acceptance of cognitive assistance systems in manual assembly,” Procedia CIRP, vol. 81, pp. 926–931, 2019, 52nd CIRP Conference on Manufacturing Systems (CMS), Ljubljana, Slovenia, June 12-14, 2019.
- M. König, M. Stadlmaier, T. Rusch, R. Sochor, L. Merkel, S. Braunreuther, and J. Schilp, “Ma2ra - manual assembly augmented reality assistant,” in 2019 IEEE International Conference on Industrial Engineering and Engineering Management (IEEM), 2019, pp. 501–505.
- K. Tainaka, Y. Fujimoto, M. Kanbara, H. Kato, A. Moteki, K. Kuraki, K. Osamura, T. Yoshitake, and T. Fukuoka, “Guideline and tool for designing an assembly task support system using augmented reality,” in 2020 IEEE International Symposium on Mixed and Augmented Reality (ISMAR), 2020, pp. 486–497.
- Z. Zou, K. Chen, Z. Shi, Y. Guo, and J. Ye, “Object detection in 20 years: A survey,” Proceedings of the IEEE, vol. 111, no. 3, pp. 257–276, 2023.
- J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
- N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in Computer Vision – ECCV 2020, A. Vedaldi, H. Bischof, T. Brox, and J.-M. Frahm, Eds. Cham: Springer International Publishing, 2020, pp. 213–229.
- S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances in Neural Information Processing Systems, C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, Eds., vol. 28. Curran Associates, Inc., 2015.
- M. Oberweger and V. Lepetit, “Deepprior++: Improving fast and accurate 3d hand pose estimation,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV) Workshops, Oct 2017.
- P. Panteleris, I. Oikonomidis, and A. Argyros, “Using a single rgb frame for real time 3d hand pose estimation in the wild,” in 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), 2018, pp. 436–445.
- F. Zhang, X. Zhu, H. Dai, M. Ye, and C. Zhu, “Distribution-aware coordinate representation for human pose estimation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
- D. Kong, L. Zhang, L. Chen, H. Ma, X. Yan, S. Sun, X. Liu, K. Han, and X. Xie, “Identity-aware hand mesh estimation and personalization from rgb images,” in Computer Vision – ECCV 2022, S. Avidan, G. Brostow, M. Cissé, G. M. Farinella, and T. Hassner, Eds. Cham: Springer Nature Switzerland, 2022, pp. 536–553.
- J. Romero, D. Tzionas, and M. J. Black, “Embodied hands: Modeling and capturing hands and bodies together,” ACM Trans. Graph., vol. 36, no. 6, nov 2017.
- H. Duan, Y. Zhao, K. Chen, D. Lin, and B. Dai, “Revisiting skeleton-based action recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022, pp. 2969–2978.
- S. Jin, L. Xu, J. Xu, C. Wang, W. Liu, C. Qian, W. Ouyang, and P. Luo, “Whole-body human pose estimation in the wild,” in Computer Vision – ECCV 2020, A. Vedaldi, H. Bischof, T. Brox, and J.-M. Frahm, Eds. Cham: Springer International Publishing, 2020, pp. 196–214.
- J. Wang, K. Sun, T. Cheng, B. Jiang, C. Deng, Y. Zhao, D. Liu, Y. Mu, M. Tan, X. Wang, W. Liu, and B. Xiao, “Deep high-resolution representation learning for visual recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 10, pp. 3349–3364, 2021.
- A. Newell, K. Yang, and J. Deng, “Stacked hourglass networks for human pose estimation,” in Computer Vision – ECCV 2016, B. Leibe, J. Matas, N. Sebe, and M. Welling, Eds. Cham: Springer International Publishing, 2016, pp. 483–499.
- G. Moon, S.-I. Yu, H. Wen, T. Shiratori, and K. M. Lee, “Interhand2.6m: A dataset and baseline for 3d interacting hand pose estimation from a single rgb image,” in Computer Vision – ECCV 2020, A. Vedaldi, H. Bischof, T. Brox, and J.-M. Frahm, Eds. Cham: Springer International Publishing, 2020, pp. 548–564.
- Y. Wang, C. Peng, and Y. Liu, “Mask-pose cascaded cnn for 2d hand pose estimation from single color image,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 29, no. 11, pp. 3258–3268, 2019.
- C. Zimmermann, D. Ceylan, J. Yang, B. Russell, M. Argus, and T. Brox, “Freihand: A dataset for markerless capture of hand pose and shape from single rgb images,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019.
- C. Zimmermann, M. Argus, and T. Brox, “Contrastive representation learning for hand shape estimation,” in Pattern Recognition, C. Bauckhage, J. Gall, and A. Schwing, Eds. Cham: Springer International Publishing, 2021, pp. 250–264.
- C. Zimmermann and T. Brox, “Learning to estimate 3d hand pose from single rgb images,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), Oct 2017.
- S. Narasimhaswamy, Z. Wei, Y. Wang, J. Zhang, and M. Hoai, “Contextual attention for hand detection in the wild,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019.
- F. Sener, D. Chatterjee, D. Shelepov, K. He, D. Singhania, R. Wang, and A. Yao, “Assembly101: A large-scale multi-view video dataset for understanding procedural activities,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022, pp. 21 096–21 106.
- MMPose-Contributors, “Openmmlab pose estimation toolbox and benchmark,” 2020.
- B. Xiao, H. Wu, and Y. Wei, “Simple baselines for human pose estimation and tracking,” in Proceedings of the European Conference on Computer Vision (ECCV), September 2018.
- T. Ozsoy, Z. Oner, and S. Oner, “An attempt to gender determine with phalanx length and the ratio of phalanxes to whole phalanx length in direct hand radiography,” Medicine Science — International Medical Journal, 2019.