Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Relation-Based Associative Joint Location for Human Pose Estimation in Videos (2107.03591v3)

Published 8 Jul 2021 in cs.CV

Abstract: Video-based human pose estimation (VHPE) is a vital yet challenging task. While deep learning methods have made significant progress for the VHPE, most approaches to this task implicitly model the long-range interaction between joints by enlarging the receptive field of the convolution. Unlike prior methods, we design a lightweight and plug-and-play joint relation extractor (JRE) to model the associative relationship between joints explicitly and automatically. The JRE takes the pseudo heatmaps of joints as input and calculates the similarity between pseudo heatmaps. In this way, the JRE flexibly learns the relationship between any two joints, allowing it to learn the rich spatial configuration of human poses. Moreover, the JRE can infer invisible joints according to the relationship between joints, which is beneficial for the model to locate occluded joints. Then, combined with temporal semantic continuity modeling, we propose a Relation-based Pose Semantics Transfer Network (RPSTN) for video-based human pose estimation. Specifically, to capture the temporal dynamics of poses, the pose semantic information of the current frame is transferred to the next with a joint relation guided pose semantics propagator (JRPSP). The proposed model can transfer the pose semantic features from the non-occluded frame to the occluded frame, making our method robust to the occlusion. Furthermore, the proposed JRE module is also suitable for image-based human pose estimation. The proposed RPSTN achieves state-of-the-art results on the video-based Penn Action dataset, Sub-JHMDB dataset, and PoseTrack2018 dataset. Moreover, the proposed JRE improves the performance of backbones on the image-based COCO2017 dataset. Code is available at https://github.com/YHDang/pose-estimation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (61)
  1. U. Iqbal, M. Garbade, and J. Gall, “Pose for action - action for pose,” in Proceedings of The International Conference on Automatic Face and Gesture Recognition (FG), 2017, pp. 438–445.
  2. B. X. Nie, C. Xiong, and S.-C. Zhu, “Joint action recognition and pose estimation from video,” in Proceedings of The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1293–1301.
  3. Y. Song, D. Demirdjian, and R. Davis, “Continuous body and hand gesture recognition for natural human-computer interaction,” ACM Transactions on Interactive Intelligent Systems, vol. 2, pp. 5:1–5:28, 2012.
  4. Q. Dang, J. Yin, W. Bin, and W. Zheng, “Deep learning based 2d human pose estimation:a survey,” Tsinghua Science and Technology, vol. 24, pp. 663–676, 2019.
  5. Z. Luo, Z. Wang, Y. Huang, T. Tan, and E. Zhou, “Rethinking the heatmap regression for bottom-up human pose estimation,” in Proceedings of The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 13 264–13 273.
  6. K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep high-resolution representation learning for human pose estimation,” in Proceedings of The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 5686–5696.
  7. X. Nie, Y. Li, L. Luo, N. Zhang, and J. Feng, “Dynamic kernel distillation for efficient pose estimation in videos,” in Proceedings of The IEEE International Conference on Computer Vision (ICCV), 2019, pp. 6941–6949.
  8. D. Zhang and M. Shah, “Human pose estimation in videos,” in Proceedings of The IEEE International Conference on Computer Vision (ICCV), 2015, pp. 2012–2020.
  9. B. Xiao, H. Wu, and Y. Wei, “Simple baselines for human pose estimation and tracking,” in Proceedings of The European Conference on Computer Vision (ECCV), 2018, pp. 472–487.
  10. J. Song, L. Wang, L. Van Gool, and O. Hilliges, “Thin-slicing network: A deep structured model for pose estimation in videos,” in Proceedings of The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 5563–5572.
  11. K. Su, D. Yu, Z. Xu, X. Geng, and C. Wang, “Multi-person pose estimation with enhanced channel-wise and spatial information,” in Proceedings of The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 5674–5682.
  12. X. Nie, J. Feng, J. Zhang, and S. Yan, “Single-stage multi-person pose machines,” in Proceedings of The IEEE International Conference on Computer Vision (ICCV), 2019, pp. 6950–6959.
  13. J. Huang, Z. Zhu, F. Guo, and G. Huang, “The devil is in the details: Delving into unbiased data processing for human pose estimation,” in Proceedings of The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 5699–5708.
  14. F. Zhang, X. Zhu, H. Dai, M. Ye, and C. Zhu, “Distribution-aware coordinate representation for human pose estimation,” in Proceedings of The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 7091–7100.
  15. L. Xu, Y. Guan, S. Jin, W. Liu, C. Qian, P. Luo, W. Ouyang, and X. Wang, “Vipnas: Efficient video pose estimation via neural architecture search,” in Proceedings of The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 16 072–16 081.
  16. W. Tang, P. Yu, and Y. Wu, “Deeply learned compositional models for human pose estimation,” in Proceedings of The European Conference on Computer Vision (ECCV), 2018, pp. 197–214.
  17. W. Tang and Y. Wu, “Does learning specific features for related parts help human pose estimation?” in Proceedings of The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 1107–1116.
  18. X. Nie, J. Feng, J. Xing, S. Xiao, and S. Yan, “Hierarchical contextual refinement networks for human pose estimation,” IEEE Transactions on Image Processing, vol. 28, no. 2, pp. 924–936, 2019.
  19. J. Wang, X. Long, Y. Gao, E. Ding, and S. Wen, “Graph-pcnn: two stage human pose estimation with graph pose refinement,” in Proceedings of The European Conference on Computer Vision (ECCV), 2020, pp. 492–508.
  20. H. N. Isack, C. Häne, C. Keskin, S. Bouaziz, Y. Boykov, S. Izadi, and S. Khamis, “Repose: Learning deep kinematic priors for fast human pose estimation,” CoRR, vol. abs/2002.03933, 2020. [Online]. Available: https://arxiv.org/abs/2002.03933
  21. M. W. Lee and I. Cohen, “Proposal maps driven MCMC for estimating human body pose in static images,” in Proceedings of The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2004, pp. 334–341.
  22. M. Andriluka, S. Roth, and B. Schiele, “Pictorial structures revisited: People detection and articulated pose estimation,” in Proceedings of The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009, pp. 1014–1021.
  23. Y. Yang and D. Ramanan, “Articulated pose estimation with flexible mixtures-of-parts,” in Proceedings of The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2011, pp. 1385–1392.
  24. V. Belagiannis and A. Zisserman, “Recurrent human pose estimation,” in Proceedings of The International Conference on Automatic Face and Gesture Recognition (FG), 2017, pp. 468–475.
  25. Y. Chen, Z. Wang, X. Peng, Z. Zhang, G. Yu, and J. Sun, “Cascaded pyramid network for multi-person pose estimation,” in Proceedings of The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 7103–7112.
  26. B. Cheng, B. Xiao, J. Wang, H. Shi, T. S. Huang, and L. Zhang, “Higherhrnet: Scale-aware representation learning for bottom-up human pose estimation,” in Proceedings of The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 5385–5394.
  27. R. Khirodkar, V. Chari, A. Agrawal, and A. Tyagi, “Multi-hypothesis pose networks: Rethinking top-down pose estimation,” in Proceedings of The IEEE International Conference on Computer Vision (ICCV), 2021.
  28. J. Li, S. Bian, A. Zeng, C. Wang, B. Pang, W. Liu, and C. Lu, “Human pose regression with residual log-likelihood estimation,” CoRR, vol. abs/2107.11291, 2021. [Online]. Available: https://arxiv.org/abs/2107.11291
  29. W. Lin, H. Liu, S. Liu, Y. Li, G. Qi, R. Qian, T. Wang, N. Sebe, N. Xu, H. Xiong, and M. Shah, “Human in events: A large-scale benchmark for human-centric video analysis in complex events,” CoRR, vol. abs/2005.04490, 2020. [Online]. Available: https://arxiv.org/abs/2005.04490
  30. F. Shamsafar and H. Ebrahimnezhad, “Uniting holistic and part-based attitudes for accurate and robust deep human pose estimation,” Journal Of Ambient Intelligence And Humanized Computing, vol. 12, no. 2, pp. 2339–2353, 2021.
  31. Y. Bin, Z. Chen, X. Wei, X. Chen, C. Gao, and N. Sang, “Structure-aware human pose estimation with graph convolutional networks,” Pattern Recognition, vol. 106, p. 107410, 2020.
  32. Y. Li, K. Li, X. Wang, and R. Y. D. Xu, “Exploring temporal consistency for human pose estimation in videos,” Pattern Recognition, vol. 103, p. 107258, 2020.
  33. R. Girdhar, G. Gkioxari, L. Torresani, M. Paluri, and D. Tran, “Detect-and-track: Efficient pose estimation in videos,” in Proceedings of The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 350–359.
  34. M. Wang, J. Tighe, and D. Modolo, “Combining detection and tracking for human pose estimation in videos,” in Proceedings of The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 11 085–11 093.
  35. S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
  36. Y. Luo, J. S. Ren, Z. Wang, W. Sun, J. Pan, J. Liu, J. Pang, and L. Lin, “Lstm pose machines,” in Proceedings of The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 5207–5215.
  37. S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh, “Convolutional pose machines,” in Proceedings of The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 4724–4732.
  38. B. Artacho and A. E. Savakis, “Unipose: Unified human pose estimation in single images and videos,” in Proceedings of The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 7033–7042.
  39. X. Nie, J. Feng, and S. Yan, “Mutual learning to adapt for joint human parsing and pose estimation,” in Proceedings of The European Conference on Computer Vision (ECCV), 2018, pp. 519–534.
  40. X. Wang, R. B. Girshick, A. Gupta, and K. He, “Non-local neural networks,” in Proceedings of The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 7794–7803.
  41. W. Zhang, M. Zhu, and K. G. Derpanis, “From actemes to action: A strongly-supervised representation for detailed action understanding,” in Proceedings of The IEEE International Conference on Computer Vision (ICCV), 2013, pp. 2248–2255.
  42. H. Jhuang, J. Gall, S. Zuffi, C. Schmid, and M. J. Black, “Towards understanding action recognition,” in Proceedings of The IEEE International Conference on Computer Vision (ICCV), 2013, pp. 3192–3199.
  43. M. Andriluka, U. Iqbal, E. Insafutdinov, L. Pishchulin, A. Milan, J. Gall, and B. Schiele, “Posetrack: A benchmark for human pose estimation and tracking,” in Proceedings of The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 5167–5176.
  44. T. Lin, M. Maire, S. J. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft COCO: common objects in context,” in Proceedings of The European Conference on Computer Vision (ECCV), 2014, pp. 740–755.
  45. M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele, “2d human pose estimation: new benchmark and state of the art analysis,” in Proceedings of The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014, pp. 3686–3693.
  46. D. P. Kingma and J. L. Ba, “Adam: A method for stochastic optimization,” in Proceedings of The International Conference on Learning Representations (ICLR), 2015, pp. 1–15.
  47. Z. Liu, H. Chen, R. Feng, S. Wu, S. Ji, B. Yang, and X. Wang, “Deep dual consecutive network for human pose estimation,” in Proceedings of The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 525–534.
  48. S. Ren, K. He, R. B. Girshick, and J. Sun, “Faster R-CNN: towards real-time object detection with region proposal networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 6, pp. 1137–1149, 2017.
  49. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.
  50. S. Chang, L. Yuan, X. Nie, Z. Huang, Y. Zhou, Y. Chen, J. Feng, and S. Yan, “Towards accurate human pose estimation in videos of crowded scenes,” in Proceedings of The International Conference on Multimedia (ACMMM), 2020, pp. 4630–4634.
  51. G. Moon, J. Y. Chang, and K. M. Lee, “Posefix: Model-agnostic general human pose refinement network,” in Proceedings of The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 7773–7781.
  52. Y. Yang and D. Ramanan, “Articulated human detection with flexible mixtures of parts,” IEEE Transactions on Software Engineering, vol. 35, no. 12, pp. 2878–2890, 2013.
  53. D. Park and D. Ramanan, “N-best maximal decoders for part models,” in Proceedings of The IEEE International Conference on Computer Vision (ICCV), 2011, pp. 2627–2634.
  54. G. Gkioxari, A. Toshev, and N. Jaitly, “Chained predictions using convolutional neural networks,” in Proceedings of The European Conference on Computer Vision (ECCV), 2016, pp. 728–743.
  55. Y. Zhang, Y. Wang, O. I. Camps, and M. Sznaier, “Key frame proposal network for efficient pose estimation in videos,” in Proceedings of The European Conference on Computer Vision (ECCV), 2020, pp. 609–625.
  56. A. Newell, K. Yang, and J. Deng, “Stacked hourglass networks for human pose estimation,” in Proceedings of The European Conference on Computer Vision (ECCV), 2016, pp. 483–499.
  57. H. Fang, S. Xie, Y. Tai, and C. Lu, “RMPE: regional multi-person pose estimation,” in Proceedings of The IEEE International Conference on Computer Vision (ICCV), 2017, pp. 2353–2362.
  58. H. Guo, T. Tang, G. Luo, R. Chen, Y. Lu, and L. Wen, “Multi-domain pose network for multi-person pose estimation and tracking,” in Proceedings of The European Conference on Computer Vision (ECCV), 2018, pp. 209–216.
  59. M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. A. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng, “Tensorflow: A system for large-scale machine learning,” in Proceedings of The 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2016, pp. 265–283.
  60. A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf, E. Z. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperative style, high-performance deep learning library,” in Proceedings of The Advances in Neural Information Processing Systems (NeurIPS), 2019, pp. 8024–8035.
  61. J. Huang, Z. Zhu, G. Huang, and D. Du, “How to train your robust human pose estimator: Pay attention to the constraint cue,” CoRR, vol. abs/2008.07139, 2020. [Online]. Available: https://arxiv.org/abs/2008.07139
Citations (18)

Summary

We haven't generated a summary for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com