Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
166 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Attention-Propagation Network for Egocentric Heatmap to 3D Pose Lifting (2402.18330v1)

Published 28 Feb 2024 in cs.CV

Abstract: We present EgoTAP, a heatmap-to-3D pose lifting method for highly accurate stereo egocentric 3D pose estimation. Severe self-occlusion and out-of-view limbs in egocentric camera views make accurate pose estimation a challenging problem. To address the challenge, prior methods employ joint heatmaps-probabilistic 2D representations of the body pose, but heatmap-to-3D pose conversion still remains an inaccurate process. We propose a novel heatmap-to-3D lifting method composed of the Grid ViT Encoder and the Propagation Network. The Grid ViT Encoder summarizes joint heatmaps into effective feature embedding using self-attention. Then, the Propagation Network estimates the 3D pose by utilizing skeletal information to better estimate the position of obscure joints. Our method significantly outperforms the previous state-of-the-art qualitatively and quantitatively demonstrated by a 23.9\% reduction of error in an MPJPE metric. Our source code is available in GitHub.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. Abien Fred Agarap. Deep learning using rectified linear units (relu), 2018. cite arxiv:1803.08375Comment: 7 pages, 11 figures, 9 tables.
  2. Unrealego: A new dataset for robust egocentric 3d human motion capture. In European Conference on Computer Vision (ECCV), 2022.
  3. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021.
  4. Skeleton graph-neural-network-based human action recognition: A survey. Sensors, 22(6):2091, 2022.
  5. Deep Residual Learning for Image Recognition. In Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778. IEEE, 2016.
  6. Epipolar transformers. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7776–7785, Los Alamitos, CA, USA, 2020. IEEE Computer Society.
  7. Long short-term memory. Neural computation, 9:1735–80, 1997.
  8. Ego3dpose: Capturing 3d cues from binocular egocentric views. In SIGGRAPH Asia 2023 Conference Papers, New York, NY, USA, 2023. Association for Computing Machinery.
  9. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), San Diega, CA, USA, 2015.
  10. Semi-supervised classification with graph convolutional networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017.
  11. Ego-body pose estimation via ego-head pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17142–17151, 2023a.
  12. Multi-hypothesis representation learning for transformer-based 3d human pose estimation. Pattern Recognition, page 109631, 2023b.
  13. Skeleton-based action recognition using spatio-temporal lstm network with trust gates. IEEE Trans. Pattern Anal. Mach. Intell., 40(12):3007–3021, 2018.
  14. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019.
  15. You2me: Inferring body pose in egocentric video via first and second person interactions. CVPR, 2020.
  16. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc., 2019.
  17. Egocap: Egocentric marker-less motion capture with two fisheye cameras. ACM Transactions on Graphics, 35, 2016.
  18. U-net: Convolutional networks for biomedical image segmentation, 2015. cite arxiv:1505.04597Comment: conditionally accepted at MICCAI 2015.
  19. Diffusion-based 3d human pose estimation with multi-hypothesis aggregation. arXiv preprint arXiv:2303.11579, 2023.
  20. xr-egopose: Egocentric 3d human pose from an hmd camera. In Proceedings of the IEEE International Conference on Computer Vision, pages 7728–7738, 2019.
  21. Efficient object localization using convolutional networks. In CVPR, pages 648–656. IEEE Computer Society, 2015.
  22. Attention is all you need. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2017.
  23. Estimating egocentric 3d human pose in global space. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 11500–11509, 2021.
  24. Estimating egocentric 3d human pose in the wild with external weak supervision. CVPR, 2022.
  25. Scene-aware egocentric 3d human pose estimation. CVPR, 2023.
  26. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online, 2020. Association for Computational Linguistics.
  27. Mo22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTCap22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT : Real-time mobile 3d motion capture with a cap-mounted fisheye camera. IEEE Transactions on Visualization and Computer Graphics, pages 1–1, 2019.
  28. Spatial temporal graph convolutional networks for skeleton-based action recognition. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1), 2018.
  29. Gla-gcn: Global-local adaptive graph convolutional network for 3d human pose estimation from monocular video. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 8818–8829, 2023.
  30. Learning skeletal graph neural networks for hard 3d pose estimation. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 11416–11425, 2021.
  31. Mixste: Seq2seq mixed spatio-temporal encoder for 3d human pose estimation in video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13232–13242, 2022.
  32. Egoglass: Egocentric-view human pose estimation from an eyeglass frame. In 2021 International Conference on 3D Vision (3DV), pages 32–41, 2021.
  33. Poseformerv2: Exploring frequency domain for efficient and robust 3d human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8877–8886, 2023.
  34. 3d human pose estimation with spatial and temporal transformers. Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2021.
Citations (1)

Summary

We haven't generated a summary for this paper yet.