In My Perspective, In My Hands: Accurate Egocentric 2D Hand Pose and Action Recognition (2404.09308v2)
Abstract: Action recognition is essential for egocentric video understanding, allowing automatic and continuous monitoring of Activities of Daily Living (ADLs) without user effort. Existing literature focuses on 3D hand pose input, which requires computationally intensive depth estimation networks or wearing an uncomfortable depth sensor. In contrast, there has been insufficient research in understanding 2D hand pose for egocentric action recognition, despite the availability of user-friendly smart glasses in the market capable of capturing a single RGB image. Our study aims to fill this research gap by exploring the field of 2D hand pose estimation for egocentric action recognition, making two contributions. Firstly, we introduce two novel approaches for 2D hand pose estimation, namely EffHandNet for single-hand estimation and EffHandEgoNet, tailored for an egocentric perspective, capturing interactions between hands and objects. Both methods outperform state-of-the-art models on H2O and FPHA public benchmarks. Secondly, we present a robust action recognition architecture from 2D hand and object poses. This method incorporates EffHandEgoNet, and a transformer-based action recognition method. Evaluated on H2O and FPHA datasets, our architecture has a faster inference time and achieves an accuracy of 91.32% and 94.43%, respectively, surpassing state of the art, including 3D-based methods. Our work demonstrates that using 2D skeletal data is a robust approach for egocentric action understanding. Extensive evaluation and ablation studies show the impact of the hand pose estimation approach, and how each input affects the overall performance.
- A. Bandini and J. Zariffa. Analysis of the Hands in Egocentric Vision: a Survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.
- Adapting Egocentric Visual Hand Pose Estimation Towards a Robot-Controlled Exoskeleton. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, pages 0–0, 2018.
- Albumentations: Fast and Flexible Image Augmentations. Information, 11(2):125, 2020.
- Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7291–7299, 2017.
- J. Carreira and A. Zisserman. Quo Vadis, Action Recognition? a New Model and the Kinetics Dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017.
- Contextually Driven First-Person Action Recognition from Videos. In Presentation at EPIC@ ICCV2017 Workshop, page 8, 2017.
- Transformer-Based Unified Recognition of Two Hands Manipulating Objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4769–4778, 2023.
- M. Contributors. OpenMMLab Pose Estimation Toolbox and Benchmark. https://github.com/open-mmlab/mmpose, 2020.
- Scaling Egocentric Vision: The EPIC-KITCHENS Dataset. In European Conference on Computer Vision (ECCV), 2018.
- P. Das and A. Ortega. Symmetric Sub-graph Spatio-temporal Graph Convolution and its Application in Complex Activity Recognition. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3215–3219. IEEE, 2021.
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations, 2021.
- Can 3D Pose be Learned from 2D Projections Alone? In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, pages 0–0, 2018.
- Revisiting Skeleton-based Action Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2969–2978, 2022.
- Slowfast Networks for Video Recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6202–6211, 2019.
- First-Person Hand Action Benchmark With RGB-D Videos and 3D Hand Pose Annotations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 409–419, 2018.
- Ego4D: Around the World in 3,000 Hours of Egocentric Video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18995–19012, 2022.
- Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
- Searching for MobileNetV3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1314–1324, 2019.
- H2O: Two Hands Manipulating Objects for First Person Interaction Recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 10138–10148, October 2021.
- Egocentric Hand Pose Estimation and Distance Recovery in a Single RGB Image. In 2015 IEEE International Conference on Multimedia and Expo (ICME), pages 1–6. IEEE, 2015.
- Swin Transformer V2: Scaling Up Capacity and Resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12009–12019, 2022.
- A ConvNet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11976–11986, 2022.
- MotionAGFormer: Enhancing 3D Human Pose Estimation with a Transformer-GCNFormer Network. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 6920–6930, 2024.
- W. Mucha and M. Kampel. Addressing Privacy Concerns in Depth Sensors. In Computers Helping People with Special Needs: 18th International Conference, ICCHP-AAATE 2022, Lecco, Italy, July 11–15, 2022, Proceedings, Part II, pages 526–533. Springer, 2022.
- W. Mucha and M. Kampel. Beyond privacy of depth sensors in active and assisted living devices. In Proceedings of the 15th International Conference on PErvasive Technologies Related to Assistive Environments, pages 425–429, 2022.
- Real-time Hand Tracking Under Occlusion From an Egocentric RGB-D Sensor. In Proceedings of the IEEE International Conference on Computer Vision, pages 1154–1163, 2017.
- A Neural Network Based on SPD Manifold Learning for Skeleton-based Hand Gesture Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12036–12045, 2019.
- Egocentric Vision-based Action Recognition: a Survey. Neurocomputing, 472:175–197, 2022.
- Domain and View-point Agnostic Hand Action Recognition. IEEE Robotics and Automation Letters, 6(4):7823–7830, 2021.
- Attention! a Lightweight 2D Hand Pose Estimation Approach. IEEE Sensors Journal, 21(10):11488–11496, 2020.
- Human Action Recognition from Various Data Modalities: a Review. IEEE, 2022.
- M. Tan and Q. Le. Efficientnetv2: Smaller Models and Faster Training. In International Conference on Machine Learning, pages 10096–10106. PMLR, 2021.
- H+ O: Unified Egocentric Recognition of 3D Hand-object Poses and Interactions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4511–4520, 2019.
- Attention is All You Need. Advances in Neural Information Processing Systems, 30, 2017.
- YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7464–7475, 2023.
- Non-local Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7794–7803, 2018.
- Mask-Pose Cascaded CNN for 2D Hand Pose Estimation From Single Color Image. IEEE Transactions on Circuits and Systems for Video Technology, 29(11):3258–3268, 2018.
- Hierarchical Temporal Transformer for 3D Hand Pose Estimation and Action Recognition from Egocentric RGB Videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21243–21253, 2023.
- Simple Baselines for Human Pose Estimation and Tracking. In Proceedings of the European Conference on Computer Vision (ECCV), pages 466–481, 2018.
- Hand Pose Estimation and Motion Recognition Using Egocentric RGB-D Video. In 2017 IEEE International Conference on Robotics and Biomimetics (ROBIO), pages 147–152. IEEE, 2017.
- Spatial Temporal Graph Convolutional Networks for Skeleton-based Action Recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
- Collaborative Learning of Gesture Recognition and 3D Hand Pose Estimation with Multi-Order Feature Analysis. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16, pages 769–786. Springer, 2020.
- Multi-scale Conditional Random Fields for First-person Activity Recognition. In 2014 IEEE International Conference on Pervasive Computing and Communications (PerCom), pages 51–59. IEEE, 2014.
- Mediapipe Hands: On-device Real-time Hand Tracking. arXiv preprint arXiv:2006.10214, 2020.
- FreiHAND: a Dataset For Markerless Capture of Hand Pose and Shape from Single RGB Images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 813–822, 2019.
- Wiktor Mucha (6 papers)
- Martin Kampel (18 papers)