Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Multi-hop graph transformer network for 3D human pose estimation (2405.03055v1)

Published 5 May 2024 in cs.CV

Abstract: Accurate 3D human pose estimation is a challenging task due to occlusion and depth ambiguity. In this paper, we introduce a multi-hop graph transformer network designed for 2D-to-3D human pose estimation in videos by leveraging the strengths of multi-head self-attention and multi-hop graph convolutional networks with disentangled neighborhoods to capture spatio-temporal dependencies and handle long-range interactions. The proposed network architecture consists of a graph attention block composed of stacked layers of multi-head self-attention and graph convolution with learnable adjacency matrix, and a multi-hop graph convolutional block comprised of multi-hop convolutional and dilated convolutional layers. The combination of multi-head self-attention and multi-hop graph convolutional layers enables the model to capture both local and global dependencies, while the integration of dilated convolutional layers enhances the model's ability to handle spatial details required for accurate localization of the human body joints. Extensive experiments demonstrate the effectiveness and generalization ability of our model, achieving competitive performance on benchmark datasets.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (56)
  1. L. Song, G. Yu, J. Yuan, and Z. Liu, “Human pose estimation and its application to action recognition: A survey,” Journal of Visual Communication and Image Representation, vol. 76, 2021.
  2. D. C. Luvizon, D. Picardu, and H. Tabia, “Multi-task deep learning for real-time 3D human pose estimation and action recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, pp. 2752–2764, 2021.
  3. A. Zanfir, M. Zanfir, A. Gorban, J. Ji, Y. Zhou, D. Anguelov, and C. Sminchisescu, “HUM3DIL: Semi-supervised multi-modal 3D human pose estimation for autonomous driving,” in Proc. Conference on Robot Learning, 2023.
  4. C. K. Ingwersen, C. Mikkelstrup, J. N. Jensen, M. R. Hannemose, and A. B. Dahl, “SportsPose – a dynamic 3D sports pose dataset,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2023.
  5. Y. Gu, S. Pandit, E. Saraee, T. Nordahl, T. Ellis, and M. Betke, “Home-based physical therapy with an interactive computer vision system,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2019.
  6. X. Zhou, X. Sun, W. Zhang, S. Liang, and Y. Wei, “Deep kinematic pose regression,” in Proc. European Conference on Computer Vision, pp. 186–201, 2016.
  7. S. Park, J. Hwang, and N. Kwak, “3D human pose estimation using convolutional neural networks with 2D pose information,” in Proc. European Conference on Computer Vision, pp. 156–169, Springer, 2016.
  8. X. Sun, B. Xiao, F. Wei, S. Liang, and Y. Wei, “Integral human pose regression,” in Proc. European Conference on Computer Vision, pp. 529–545, 2018.
  9. G. Pavlakos, X. Zhou, K. G. Derpanis, and K. Daniilidis, “Coarse-to-fine volumetric prediction for single-image 3D human pose,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, pp. 7025–7034, 2017.
  10. X. Sun, J. Shang, S. Liang, and Y. Wei, “Compositional human pose regression,” in Proc. IEEE International Conference on Computer Vision, pp. 2602–2611, 2017.
  11. W. Yang, W. Ouyang, X. Wang, J. Ren, H. Li, and X. Wang, “3D human pose estimation in the wild by adversarial learning,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, pp. 5255–5264, 2018.
  12. Z. Chen, Y. Huang, H. Yu, B. Xue, K. Han, Y. Guo, and L. Wang, “Towards part-aware monocular 3D human pose estimation: An architecture search approach,” in Proc. European Conference on Computer Vision, pp. 715–732, 2020.
  13. K. Lee, I. Lee, and S. Lee, “Propagating LSTM: 3D pose estimation based on joint interdependency,” in Proc. European Conference on Computer Vision, pp. 119–135, 2018.
  14. C.-H. Chen and D. Ramanan, “3D human pose estimation = 2D pose estimation+ matching,” in Proc. IEEE conference on Computer Vision and Pattern Recognition, pp. 7035–7043, 2017.
  15. D. Tome, C. Russell, and L. Agapito, “Lifting from the deep: Convolutional 3D pose estimation from a single image,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, pp. 2500–2509, 2017.
  16. B. Tekin, P. Márquez-Neila, M. Salzmann, and P. Fua, “Learning to fuse 2D and 3D image cues for monocular body pose estimation,” in Proc. IEEE International Conference on Computer Vision, pp. 3941–3950, 2017.
  17. Y. Chen, Z. Wang, Y. Peng, Z. Zhang, G. Yu, and J. Sun, “Cascaded pyramid network for multi-person pose estimation,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, pp. 7103–7112, 2018.
  18. K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep high-resolution representation learning for human pose estimation,” in Proc. Conference on Computer Vision and Pattern Recognition, 2019.
  19. C. Zheng, W. Wu, C. Chen, T. Yang, S. Zhu, J. Shen, N. Kehtarnavaz, and M. Shah, “Deep learning-based human pose estimation: A survey,” ACM Computing Surveys, 2023.
  20. L. Zhao, X. Peng, Y. Tian, M. Kapadia, and D. N. Metaxas, “Semantic graph convolutional networks for 3D human pose regression,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, pp. 3425–3435, 2019.
  21. N. Azizi, H. Possegger, E. Rodolà, and H. Bischof, “3D human pose estimation using Möbius graph convolutional networks,” in Proc. European Conference on Computer Vision, pp. 160–178, 2022.
  22. Z. Zhang, “Group graph convolutional networks for 3D human pose estimation,” in Proc. British Machine Vision Conference, 2022.
  23. W. Zhao, W. Wang, and Y. Tian, “GraFormer: Graph-oriented transformer for 3D pose estimation,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, pp. 20438–20447, 2022.
  24. C. Zheng, S. Zhu, M. Mendieta, T. Yang, C. Chen, and Z. Ding, “3D human pose estimation with spatial and temporal transformers,” in Proc. IEEE International Conference on Computer Vision, 2021.
  25. Z. Zou, K. Liu, L. Wang, and W. Tang, “High-order graph convolutional networks for 3D human pose estimation,” in Proc. British Machine Vision Conference, 2020.
  26. J. Quan and A. Ben Hamza, “Higher-order implicit fairing networks for 3D human pose estimation,” in Proc. British Machine Vision Conference, 2021.
  27. Z. Liu, H. Zhang, Z. Chen, Z. Wang, and W. Ouyang, “Disentangling and unifying graph convolutions for skeleton-based action recognition,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, pp. 143–152, 2020.
  28. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, 2017.
  29. D. Pavllo, C. Feichtenhofer, D. Grangier, and M. Auli, “3D human pose estimation in video with temporal convolutions and semi-supervised training,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, pp. 7753–7762, 2019.
  30. Y. Cai, L. Ge, J. Liu, J. Cai, T.-J. Cham, J. Yuan, and N. M. Thalmann, “Exploiting spatial-temporal relationships for 3d pose estimation via graph convolutional networks,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, pp. 2272–2281, 2019.
  31. A. Zeng, X. Sun, F. Huang, M. Liu, Q. Xu, and S. Lin, “SRNet: Improving generalization in 3D human pose estimation with a split-and-recombine approach,” in Proc. European Conference on Computer Vision, pp. 507–523, 2020.
  32. R. Liu, J. Shen, H. Wang, C. Chen, S.-C. Cheung, and V. Asari, “Attention mechanism exploits temporal contexts: Real-time 3D human pose reconstruction,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, pp. 5064–5073, 2020.
  33. T. Chen, C. Fang, X. Shen, Y. Zhu, Z. Chen, and J. Luo, “Anatomy-aware 3D human pose estimation with bone-based pose decomposition,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 1, pp. 198–209, 2021.
  34. J. Cai, H. Liu, R. Ding, W. Li, J. Wu, and M. Ban, “HTNet: human topology aware network for 3D human pose estimation,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, 2023.
  35. J. Martinez, R. Hossain, J. Romero, and J. J. Little, “A simple yet effective baseline for 3D human pose estimation,” in Proc. IEEE International Conference on Computer Vision, pp. 2640–2649, 2017.
  36. Z. Zou and W. Tang, “Modulated graph convolutional network for 3D human pose estimation,” in Proc. IEEE International Conference on Computer Vision, pp. 11477–11487, 2021.
  37. J. Y. Lee and I. G. Kim, “Multi-hop modulated graph convolutional networks for 3D human pose estimation,” in Proc. British Machine Vision Conference, 2022.
  38. Z. Islam and A. Ben Hamza, “Iterative graph filtering network for 3D human pose estimation,” Journal of Visual Communication and Image Representation, vol. 95, 2023.
  39. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” in International Conference on Learning Representations, 2021.
  40. F. Yu and V. Koltun, “Multi-scale context aggregation by dilated convolutions,” in International Conference on Learning Representations, 2016.
  41. H. Zou and T. Hastie, “Regularization and variable selection via the elastic net,” Journal of the Royal Statistical Society. Series B, vol. 60, no. 1, pp. 301–320, 2005.
  42. C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu, “Human3.6M: Large scale datasets and predictive methods for 3D human sensing in natural environments,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 7, pp. 1325–1339, 2013.
  43. D. Mehta, H. Rhodin, D. Casas, P. Fua, O. Sotnychenko, W. Xu, and C. Theobalt, “Monocular 3D human pose estimation in the wild using improved cnn supervision,” in Proc. International Conference on 3D Vision, pp. 506–516, 2017.
  44. G. Pavlakos, X. Zhou, and K. Daniilidis, “Ordinal depth supervision for 3D human pose estimation,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, pp. 7307–7316, 2018.
  45. J. Liu, J. Rojas, Y. Li, Z. Liang, Y. Guan, N. Xi, and H. Zhu, “A graph attention spatio-temporal convolutional network for 3D human pose estimation in video,” in Proc. IEEE International Conference on Robotics and Automation, pp. 3374–3380, 2021.
  46. A. Zeng, X. Sun, L. Yang, N. Zhao, M. Liu, and Q. Xu, “Learning skeletal graph neural networks for hard 3D pose estimation,” in Proc. IEEE International Conference on Computer Vision, pp. 11436–11445, 2021.
  47. Q. Zhao, C. Zheng, M. Liu, P. Wang, and C. Chen, “PoseFormerV2: Exploring frequency domain for efficient and robust 3D human pose estimation,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2023.
  48. M. R. I. Hossain and J. J. Little, “Exploiting temporal information for 3D human pose estimation,” in Proc. European Conference on Computer Vision, pp. 68–84, 2018.
  49. J. Lin and G. H. Lee, “Trajectory space factorization for deep video-based 3D human pose estimation,” in Proc. British Machine Vision Conference, 2019.
  50. C. Li and G. H. Lee, “Weakly supervised generative network for multiple 3D human pose hypotheses,” in Proc. British Machine Vision Conference, 2020.
  51. C. Li and G. H. Lee, “Generating multiple hypotheses for 3D human pose estimation with mixture density network,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, pp. 9887–9895, 2019.
  52. I. Habibie, W. Xu, D. Mehta, G. Pons-Moll, and C. Theobalt, “In the wild human pose estimation using explicit 2D features and intermediate 3D representations,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, pp. 10905–10914, 2019.
  53. T. Xu and W. Takano, “Graph stacked hourglass networks for 3D human pose estimation,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, pp. 16105–16114, 2021.
  54. Y. Zhan, F. Li, R. Weng, and W. Choi, “Ray3D: Ray-based 3D human pose estimation for monocular absolute 3D localization,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, pp. 13116–13125, 2022.
  55. M. Hassanin, A. Khamiss, M. Bennamoun, F. Boussaid, and I. Radwan, “Crossformer: Cross spatio-temporal transformer for 3D human pose estimation,” arXiv preprint arXiv:2203.13387, 2022.
  56. W. Mao, M. Liu, M. Salzmann, and H. Li, “Learning trajectory dependencies for human motion prediction,” in Proc. IEEE International Conference on Computer Vision, pp. 9489–9497, 2019.
Citations (3)

Summary

We haven't generated a summary for this paper yet.