Attention-Propagation Network for Egocentric Heatmap to 3D Pose Lifting (2402.18330v1)
Abstract: We present EgoTAP, a heatmap-to-3D pose lifting method for highly accurate stereo egocentric 3D pose estimation. Severe self-occlusion and out-of-view limbs in egocentric camera views make accurate pose estimation a challenging problem. To address the challenge, prior methods employ joint heatmaps-probabilistic 2D representations of the body pose, but heatmap-to-3D pose conversion still remains an inaccurate process. We propose a novel heatmap-to-3D lifting method composed of the Grid ViT Encoder and the Propagation Network. The Grid ViT Encoder summarizes joint heatmaps into effective feature embedding using self-attention. Then, the Propagation Network estimates the 3D pose by utilizing skeletal information to better estimate the position of obscure joints. Our method significantly outperforms the previous state-of-the-art qualitatively and quantitatively demonstrated by a 23.9\% reduction of error in an MPJPE metric. Our source code is available in GitHub.
- Abien Fred Agarap. Deep learning using rectified linear units (relu), 2018. cite arxiv:1803.08375Comment: 7 pages, 11 figures, 9 tables.
- Unrealego: A new dataset for robust egocentric 3d human motion capture. In European Conference on Computer Vision (ECCV), 2022.
- An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021.
- Skeleton graph-neural-network-based human action recognition: A survey. Sensors, 22(6):2091, 2022.
- Deep Residual Learning for Image Recognition. In Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778. IEEE, 2016.
- Epipolar transformers. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7776–7785, Los Alamitos, CA, USA, 2020. IEEE Computer Society.
- Long short-term memory. Neural computation, 9:1735–80, 1997.
- Ego3dpose: Capturing 3d cues from binocular egocentric views. In SIGGRAPH Asia 2023 Conference Papers, New York, NY, USA, 2023. Association for Computing Machinery.
- Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), San Diega, CA, USA, 2015.
- Semi-supervised classification with graph convolutional networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017.
- Ego-body pose estimation via ego-head pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17142–17151, 2023a.
- Multi-hypothesis representation learning for transformer-based 3d human pose estimation. Pattern Recognition, page 109631, 2023b.
- Skeleton-based action recognition using spatio-temporal lstm network with trust gates. IEEE Trans. Pattern Anal. Mach. Intell., 40(12):3007–3021, 2018.
- Decoupled weight decay regularization. In International Conference on Learning Representations, 2019.
- You2me: Inferring body pose in egocentric video via first and second person interactions. CVPR, 2020.
- Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc., 2019.
- Egocap: Egocentric marker-less motion capture with two fisheye cameras. ACM Transactions on Graphics, 35, 2016.
- U-net: Convolutional networks for biomedical image segmentation, 2015. cite arxiv:1505.04597Comment: conditionally accepted at MICCAI 2015.
- Diffusion-based 3d human pose estimation with multi-hypothesis aggregation. arXiv preprint arXiv:2303.11579, 2023.
- xr-egopose: Egocentric 3d human pose from an hmd camera. In Proceedings of the IEEE International Conference on Computer Vision, pages 7728–7738, 2019.
- Efficient object localization using convolutional networks. In CVPR, pages 648–656. IEEE Computer Society, 2015.
- Attention is all you need. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2017.
- Estimating egocentric 3d human pose in global space. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 11500–11509, 2021.
- Estimating egocentric 3d human pose in the wild with external weak supervision. CVPR, 2022.
- Scene-aware egocentric 3d human pose estimation. CVPR, 2023.
- Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online, 2020. Association for Computational Linguistics.
- Mo22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTCap22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT : Real-time mobile 3d motion capture with a cap-mounted fisheye camera. IEEE Transactions on Visualization and Computer Graphics, pages 1–1, 2019.
- Spatial temporal graph convolutional networks for skeleton-based action recognition. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1), 2018.
- Gla-gcn: Global-local adaptive graph convolutional network for 3d human pose estimation from monocular video. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 8818–8829, 2023.
- Learning skeletal graph neural networks for hard 3d pose estimation. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 11416–11425, 2021.
- Mixste: Seq2seq mixed spatio-temporal encoder for 3d human pose estimation in video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13232–13242, 2022.
- Egoglass: Egocentric-view human pose estimation from an eyeglass frame. In 2021 International Conference on 3D Vision (3DV), pages 32–41, 2021.
- Poseformerv2: Exploring frequency domain for efficient and robust 3d human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8877–8886, 2023.
- 3d human pose estimation with spatial and temporal transformers. Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2021.