Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Lifting by Image -- Leveraging Image Cues for Accurate 3D Human Pose Estimation (2312.15636v1)

Published 25 Dec 2023 in cs.CV

Abstract: The "lifting from 2D pose" method has been the dominant approach to 3D Human Pose Estimation (3DHPE) due to the powerful visual analysis ability of 2D pose estimators. Widely known, there exists a depth ambiguity problem when estimating solely from 2D pose, where one 2D pose can be mapped to multiple 3D poses. Intuitively, the rich semantic and texture information in images can contribute to a more accurate "lifting" procedure. Yet, existing research encounters two primary challenges. Firstly, the distribution of image data in 3D motion capture datasets is too narrow because of the laboratorial environment, which leads to poor generalization ability of methods trained with image information. Secondly, effective strategies for leveraging image information are lacking. In this paper, we give new insight into the cause of poor generalization problems and the effectiveness of image features. Based on that, we propose an advanced framework. Specifically, the framework consists of two stages. First, we enable the keypoints to query and select the beneficial features from all image patches. To reduce the keypoints attention to inconsequential background features, we design a novel Pose-guided Transformer Layer, which adaptively limits the updates to unimportant image patches. Then, through a designed Adaptive Feature Selection Module, we prune less significant image patches from the feature map. In the second stage, we allow the keypoints to further emphasize the retained critical image features. This progressive learning approach prevents further training on insignificant image features. Experimental results show that our model achieves state-of-the-art performance on both the Human3.6M dataset and the MPI-INF-3DHP dataset.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (32)
  1. Lsap: Rethinking inversion fidelity, perception and editability in gan latent space. arXiv preprint arXiv:2209.12746.
  2. What Decreases Editing Capability? Domain-Specific Hybrid Refinement for Improved GAN Inversion. arXiv preprint arXiv:2301.12141.
  3. Concept-centric Personalization with Large-scale Diffusion Priors. arXiv preprint arXiv:2312.08195.
  4. Anatomy-aware 3d human pose estimation with bone-based pose decomposition. IEEE Transactions on Circuits and Systems for Video Technology, 32(1): 198–209.
  5. Cascaded pyramid network for multi-person pose estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition, 7103–7112.
  6. DWnet: Deep-wide network for 3D action recognition. Robotics and Autonomous Systems, 126: 103441.
  7. Learning Human Kinematics by Modeling Temporal Correlations between Joints for Video-based Human Pose Estimation. arXiv preprint arXiv:2207.10971.
  8. Towards more realistic human motion prediction with attention to motion coordination. IEEE Transactions on Circuits and Systems for Video Technology, 32(9): 5846–5858.
  9. Learning pose grammar to encode human body configuration for 3d pose estimation. In Proceedings of the AAAI conference on artificial intelligence, volume 32.
  10. Diffpose: Toward more reliable 3d pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13041–13051.
  11. Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE transactions on pattern analysis and machine intelligence, 36(7): 1325–1339.
  12. Generating multiple hypotheses for 3d human pose estimation with mixture density network. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 9887–9895.
  13. Pose-Oriented Transformer with Uncertainty-Guided Refinement for 2D-to-3D Human Pose Estimation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, 1296–1304.
  14. 3d human pose estimation from monocular images with deep convolutional neural network. In Computer Vision–ACCV 2014: 12th Asian Conference on Computer Vision, Singapore, Singapore, November 1-5, 2014, Revised Selected Papers, Part II 12, 332–347. Springer.
  15. Cascaded deep monocular 3d human pose estimation with evolutionary training data. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 6173–6183.
  16. Mhformer: Multi-hypothesis transformer for 3d human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13147–13156.
  17. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, 740–755. Springer.
  18. Feature boosting network for 3D pose estimation. IEEE transactions on pattern analysis and machine intelligence, 42(2): 494–501.
  19. Trajectorycnn: a new spatio-temporal feature learning network for human motion prediction. IEEE Transactions on Circuits and Systems for Video Technology, 31(6): 2133–2146.
  20. A simple yet effective baseline for 3d human pose estimation. In Proceedings of the IEEE international conference on computer vision, 2640–2649.
  21. Monocular 3d human pose estimation in the wild using improved cnn supervision. In 2017 international conference on 3D vision (3DV), 506–516. IEEE.
  22. Stacked hourglass networks for human pose estimation. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VIII 14, 483–499. Springer.
  23. Monocular 3d human pose estimation by predicting depth on joints. In 2017 IEEE International Conference on Computer Vision (ICCV), 3467–3475. IEEE.
  24. 3d human pose estimation in video with temporal convolutions and semi-supervised training. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 7753–7762.
  25. Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 5693–5703.
  26. Attention is all you need. Advances in neural information processing systems, 30.
  27. Graph stacked hourglass networks for 3d human pose estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 16105–16114.
  28. Monocular 3d pose estimation via pose grammar and data augmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(10): 6327–6344.
  29. Semantic graph convolutional networks for 3d human pose regression. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 3425–3435.
  30. Graformer: Graph-oriented transformer for 3d pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 20438–20447.
  31. Hemlets pose: Learning part-centric heatmap triplets for accurate 3d human pose estimation. In Proceedings of the IEEE/CVF international conference on computer vision, 2344–2353.
  32. Modulated graph convolutional network for 3D human pose estimation. In Proceedings of the IEEE/CVF international conference on computer vision, 11477–11487.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Feng Zhou (195 papers)
  2. Jianqin Yin (54 papers)
  3. Peiyang Li (11 papers)
Citations (8)

Summary

We haven't generated a summary for this paper yet.