Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

VQ-HPS: Human Pose and Shape Estimation in a Vector-Quantized Latent Space (2312.08291v4)

Published 13 Dec 2023 in cs.CV

Abstract: Previous works on Human Pose and Shape Estimation (HPSE) from RGB images can be broadly categorized into two main groups: parametric and non-parametric approaches. Parametric techniques leverage a low-dimensional statistical body model for realistic results, whereas recent non-parametric methods achieve higher precision by directly regressing the 3D coordinates of the human body mesh. This work introduces a novel paradigm to address the HPSE problem, involving a low-dimensional discrete latent representation of the human mesh and framing HPSE as a classification task. Instead of predicting body model parameters or 3D vertex coordinates, we focus on predicting the proposed discrete latent representation, which can be decoded into a registered human mesh. This innovative paradigm offers two key advantages. Firstly, predicting a low-dimensional discrete representation confines our predictions to the space of anthropomorphic poses and shapes even when little training data is available. Secondly, by framing the problem as a classification task, we can harness the discriminative power inherent in neural networks. The proposed model, VQ-HPS, predicts the discrete latent representation of the mesh. The experimental results demonstrate that VQ-HPS outperforms the current state-of-the-art non-parametric approaches while yielding results as realistic as those produced by parametric methods when trained with little data. VQ-HPS also shows promising results when training on large-scale datasets, highlighting the significant potential of the classification approach for HPSE. See the project page at https://g-fiche.github.io/research-pages/vqhps/

Definition Search Book Streamline Icon: https://streamlinehq.com
References (78)
  1. Scape: Shape completion and animation of people. ACM Transactions on Graphics (TOG), 24(3), 2005.
  2. 3d multi-bodies: Fitting sets of plausible 3d human models to ambiguous image data. Advances in Neural Information Processing Systems (NIPS), 33:20496–20507, 2020.
  3. BEDLAM: A synthetic dataset of bodies exhibiting detailed lifelike animated motion. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8726–8737, 2023.
  4. Keep it SMPL: Automatic estimation of 3D human pose and shape from a single image. In European Conference on Computer Vision (ECCV), pages 561–578. Springer, 2016.
  5. Smpler-x: Scaling up expressive human pose and shape estimation. Advances in Neural Information Processing Systems (NIPS), 36, 2023.
  6. Playing for 3d human recovery. arXiv preprint arXiv:2110.07588, 2021.
  7. Cross-attention of disentangled modalities for 3d human mesh recovery with transformers. In European Conference on Computer Vision (ECCV), pages 342–359. Springer, 2022.
  8. Beyond static features for temporally consistent 3d human pose and shape from a video. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1964–1973, 2021.
  9. Pose2mesh: Graph convolutional network for 3d human pose and mesh recovery from a 2d human pose. In European Conference on Computer Vision (ECCV), pages 769–787. Springer, 2020.
  10. Subunets: End-to-end hand shape and continuous sign language recognition. In IEEE/CVF International Conference on Computer Vision (ICCV), pages 3056–3065, 2017.
  11. I. Cohen and H. Li. Inference of human postures by classification of 3d human body shape. In IEEE International Workshop on Analysis and Modeling of Faces and Gestures, pages 74–81, 2003.
  12. Learned vertex descent: a new direction for 3d human model fitting. In European Conference on Computer Vision (ECCV), pages 146–165. Springer, 2022.
  13. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (ICLR), 2021.
  14. Tore: Token reduction for efficient human mesh recovery with transformer. In IEEE/CVF International Conference on Computer Vision (ICCV), pages 15143–15155, 2023.
  15. Activity-conditioned continuous human pose estimation for performance analysis of athletes using the example of swimming. In IEEE/CVF Winter conference on Applications of Computer Vision (WACV), pages 446–455. IEEE, 2018.
  16. Revitalizing optimization for 3d human pose and shape estimation: A sparse constrained formulation. In IEEE/CVF International Conference on Computer Vision (ICCV), pages 11457–11466, 2021.
  17. Learning analytical posterior probability for human mesh recovery. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8781–8791, 2023.
  18. A tool for extracting 3d avatar-ready gesture animations from monocular videos. In ACM SIGGRAPH Conference on Motion, Interaction and Games (ACM MIG), pages 1–7, 2022.
  19. Humans in 4d: Reconstructing and tracking humans with transformers. In IEEE/CVF International Conference on Computer Vision (ICCV), pages 14783–14794, 2023.
  20. R. A. Guler and I. Kokkinos. Holopose: Holistic 3d human reconstruction in-the-wild. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10884–10894, 2019.
  21. Head pose estimation: Classification or regression? In International Conference on Pattern Recognition (ICPR), pages 1–4. IEEE, 2008.
  22. Deep residual learning for image recognition. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
  23. Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 36(7):1325–1339, 2013.
  24. Exemplar fine-tuning for 3d human model fitting towards in-the-wild 3d human pose estimation. In International Conference on 3D Vision (3DV), pages 42–52. IEEE, 2021.
  25. End-to-end recovery of human shape and pose. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7122–7131, 2018.
  26. Learning 3d human dynamics from video. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5614–5623, 2019.
  27. EMDB: The Electromagnetic Database of Global 3D Human Pose and Shape in the Wild. In IEEE/CVF International Conference on Computer Vision (ICCV), pages 14632–14643, 2023.
  28. Hand pose estimation and hand shape classification using multi-layered randomized decision forests. In European Conference on Computer Vision (ECCV), pages 852–863. Springer, 2012.
  29. Vibe: Video inference for human body pose and shape estimation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5253–5263, 2020.
  30. PARE: Part attention regressor for 3D human body estimation. In IEEE/CVF International Conference on Computer Vision (ICCV), pages 11127–11137, 2021.
  31. Learning to reconstruct 3d human pose and shape via model-fitting in the loop. In IEEE/CVF International Conference on Computer Vision (ICCV), pages 2252–2261, 2019.
  32. Convolutional mesh regression for single-image human shape reconstruction. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4501–4510, 2019.
  33. Probabilistic modeling for human mesh recovery. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11605–11614, 2021.
  34. Unite the people: Closing the loop between 3d and 2d human representations. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6050–6059, 2017.
  35. Simcc: A simple coordinate classification perspective for human pose estimation. In European Conference on Computer Vision (ECCV), pages 89–106. Springer, 2022.
  36. Cliff: Carrying location information in full frames into human pose and shape estimation. In European Conference on Computer Vision (ECCV), pages 590–606. Springer, 2022.
  37. One-stage 3d whole-body mesh recovery with component aware transformer. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21159–21168, 2023.
  38. End-to-end human pose and mesh reconstruction with transformers. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1954–1963, 2021.
  39. Mesh graphormer. In IEEE/CVF International Conference on Computer Vision (ICCV), pages 12939–12948, 2021.
  40. Smpl: A skinned multi-person linear model. ACM Transactions on Graphics (TOG), 34(6), 2015.
  41. Posegpt: Quantization-based 3d human motion generation and forecasting. In European Conference on Computer Vision (ECCV), pages 417–435. Springer, 2022.
  42. 3d human motion estimation via motion compression and refinement. In Asian Conference on Computer Vision (ACCV), 2020.
  43. Amass: Archive of motion capture as surface shapes. In IEEE/CVF International Conference on Computer Vision (ICCV), pages 5442–5451, 2019.
  44. Neuralannot: Neural annotator for 3d human mesh training sets. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2299–2307, 2022.
  45. G. Moon and K. M. Lee. I2l-meshnet: Image-to-lixel prediction network for accurate 3d human pose and mesh estimation from a single rgb image. In European Conference on Computer Vision (ECCV), pages 752–768. Springer, 2020.
  46. SUPR: A sparse unified part-based human representation. In European Conference on Computer Vision (ECCV), pages 568–585. Springer, 2022.
  47. AGORA: Avatars in geography optimized for regression analysis. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13468–13478, 2021.
  48. Expressive body capture: 3d hands, face, and body from a single image. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10975–10985, 2019.
  49. Humor: 3d human motion model for robust pose estimation. In IEEE/CVF International Conference on Computer Vision (ICCV), pages 11488–11499, 2021.
  50. G. Rogez and C. Schmid. Mocap-guided data augmentation for 3d pose estimation in the wild. Advances in neural information processing systems (NIPS), 29:3108–3116, 2016.
  51. Lcr-net++: Multi-person 2d and 3d pose detection in natural images. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 42(5):1146–1161, 2019.
  52. Imagenet large scale visual recognition challenge. International Journal of Computer Vision (IJCV), 115:211–252, 2015.
  53. Face shape classification from 3d human data by using svm. In Biomedical Engineering International Conference, pages 1–5. IEEE, 2014.
  54. Probabilistic 3d human shape and pose estimation from multiple unconstrained images in the wild. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16094–16104, 2021.
  55. Humaniflow: Ancestor-conditioned normalising flows on so (3) manifolds for human pose and shape distribution estimation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4779–4789, 2023.
  56. Phasemp: Robust 3d pose estimation via phase-conditioned human motion prior. In IEEE/CVF International Conference on Computer Vision (ICCV), pages 14725–14737, 2023.
  57. Bailando: 3d dance generation by actor-critic gpt with choreographic memory. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11050–11059, 2022.
  58. Deepphase: Periodic autoencoders for learning motion phase manifolds. ACM Transactions on Graphics (TOG), 41(4), 2022.
  59. Pose-ndf: Modeling human pose manifolds with neural distance fields. In European Conference on Computer Vision (ECCV), pages 572–589. Springer, 2022.
  60. Neural discrete representation learning. Advances in neural information processing systems (NIPS), 30:6306–6315, 2017.
  61. Learning from synthetic humans. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 109–117, 2017.
  62. Attention is all you need. Advances in Neural Information Processing Systems (NIPS), 30:5998–6008, 2017.
  63. Recovering accurate 3d human pose in the wild using imus and a moving camera. In European Conference on Computer Vision (ECCV), pages 601–617. Springer, 2018.
  64. D. Wang and S. Zhang. 3d human mesh recovery with sequentially global rotation estimation. In IEEE/CVF International Conference on Computer Vision (ICCV), pages 14953–14962, 2023.
  65. Ai coach: Deep human pose estimation and analysis for personalized athletic training assistance. In ACM International Conference on Multimedia (ACM MM), pages 374–382, 2019.
  66. Deep high-resolution representation learning for visual recognition. IEEE transactions on Pattern Analysis and Machine Intelligence (PAMI), 43(10):3349–3364, 2020.
  67. Ghum & ghuml: Generative 3d human shape and articulated pose models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6184–6193, 2020.
  68. 3d human shape and pose from a single low-resolution image with self-supervised learning. In European Conference on Computer Vision (ECCV), pages 284–300. Springer, 2020.
  69. 3d human pose, shape and texture from low-resolution images and videos. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 44(9):4490–4504, 2021.
  70. Qpgesture: Quantization-based and phase-guided motion matching for natural speech-driven gesture generation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2321–2330, 2023.
  71. Monocular 3d pose and shape estimation of multiple people in natural scenes-the importance of multiple scene constraints. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2148–2157, 2018.
  72. Learning 3d human shape and pose from dense body parts. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 44(5):2610–2627, 2020.
  73. Pymaf: 3d human pose and shape regression with pyramidal mesh alignment feedback loop. In IEEE/CVF International Conference on Computer Vision (ICCV), pages 11446–11456, 2021.
  74. Learning physically simulated tennis skills from broadcast videos. ACM Transactions on Graphics (TOG), 42(4), 2023.
  75. T2m-gpt: Generating human motion from textual descriptions with discrete representations. arXiv preprint arXiv:2301.06052, 2023.
  76. Potter: Pooling attention transformer for efficient human mesh recovery. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1611–1620, 2023.
  77. On the continuity of rotation representations in neural networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5745–5753, 2019.
  78. Fully convolutional mesh autoencoder using efficient spatially varying kernels. Advances in neural information processing systems (NIPS), 33:9251–9262, 2020.
Citations (2)

Summary

We haven't generated a summary for this paper yet.