VQ-HPS: Human Pose and Shape Estimation in a Vector-Quantized Latent Space (2312.08291v4)
Abstract: Previous works on Human Pose and Shape Estimation (HPSE) from RGB images can be broadly categorized into two main groups: parametric and non-parametric approaches. Parametric techniques leverage a low-dimensional statistical body model for realistic results, whereas recent non-parametric methods achieve higher precision by directly regressing the 3D coordinates of the human body mesh. This work introduces a novel paradigm to address the HPSE problem, involving a low-dimensional discrete latent representation of the human mesh and framing HPSE as a classification task. Instead of predicting body model parameters or 3D vertex coordinates, we focus on predicting the proposed discrete latent representation, which can be decoded into a registered human mesh. This innovative paradigm offers two key advantages. Firstly, predicting a low-dimensional discrete representation confines our predictions to the space of anthropomorphic poses and shapes even when little training data is available. Secondly, by framing the problem as a classification task, we can harness the discriminative power inherent in neural networks. The proposed model, VQ-HPS, predicts the discrete latent representation of the mesh. The experimental results demonstrate that VQ-HPS outperforms the current state-of-the-art non-parametric approaches while yielding results as realistic as those produced by parametric methods when trained with little data. VQ-HPS also shows promising results when training on large-scale datasets, highlighting the significant potential of the classification approach for HPSE. See the project page at https://g-fiche.github.io/research-pages/vqhps/
- Scape: Shape completion and animation of people. ACM Transactions on Graphics (TOG), 24(3), 2005.
- 3d multi-bodies: Fitting sets of plausible 3d human models to ambiguous image data. Advances in Neural Information Processing Systems (NIPS), 33:20496–20507, 2020.
- BEDLAM: A synthetic dataset of bodies exhibiting detailed lifelike animated motion. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8726–8737, 2023.
- Keep it SMPL: Automatic estimation of 3D human pose and shape from a single image. In European Conference on Computer Vision (ECCV), pages 561–578. Springer, 2016.
- Smpler-x: Scaling up expressive human pose and shape estimation. Advances in Neural Information Processing Systems (NIPS), 36, 2023.
- Playing for 3d human recovery. arXiv preprint arXiv:2110.07588, 2021.
- Cross-attention of disentangled modalities for 3d human mesh recovery with transformers. In European Conference on Computer Vision (ECCV), pages 342–359. Springer, 2022.
- Beyond static features for temporally consistent 3d human pose and shape from a video. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1964–1973, 2021.
- Pose2mesh: Graph convolutional network for 3d human pose and mesh recovery from a 2d human pose. In European Conference on Computer Vision (ECCV), pages 769–787. Springer, 2020.
- Subunets: End-to-end hand shape and continuous sign language recognition. In IEEE/CVF International Conference on Computer Vision (ICCV), pages 3056–3065, 2017.
- I. Cohen and H. Li. Inference of human postures by classification of 3d human body shape. In IEEE International Workshop on Analysis and Modeling of Faces and Gestures, pages 74–81, 2003.
- Learned vertex descent: a new direction for 3d human model fitting. In European Conference on Computer Vision (ECCV), pages 146–165. Springer, 2022.
- An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (ICLR), 2021.
- Tore: Token reduction for efficient human mesh recovery with transformer. In IEEE/CVF International Conference on Computer Vision (ICCV), pages 15143–15155, 2023.
- Activity-conditioned continuous human pose estimation for performance analysis of athletes using the example of swimming. In IEEE/CVF Winter conference on Applications of Computer Vision (WACV), pages 446–455. IEEE, 2018.
- Revitalizing optimization for 3d human pose and shape estimation: A sparse constrained formulation. In IEEE/CVF International Conference on Computer Vision (ICCV), pages 11457–11466, 2021.
- Learning analytical posterior probability for human mesh recovery. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8781–8791, 2023.
- A tool for extracting 3d avatar-ready gesture animations from monocular videos. In ACM SIGGRAPH Conference on Motion, Interaction and Games (ACM MIG), pages 1–7, 2022.
- Humans in 4d: Reconstructing and tracking humans with transformers. In IEEE/CVF International Conference on Computer Vision (ICCV), pages 14783–14794, 2023.
- R. A. Guler and I. Kokkinos. Holopose: Holistic 3d human reconstruction in-the-wild. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10884–10894, 2019.
- Head pose estimation: Classification or regression? In International Conference on Pattern Recognition (ICPR), pages 1–4. IEEE, 2008.
- Deep residual learning for image recognition. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
- Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 36(7):1325–1339, 2013.
- Exemplar fine-tuning for 3d human model fitting towards in-the-wild 3d human pose estimation. In International Conference on 3D Vision (3DV), pages 42–52. IEEE, 2021.
- End-to-end recovery of human shape and pose. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7122–7131, 2018.
- Learning 3d human dynamics from video. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5614–5623, 2019.
- EMDB: The Electromagnetic Database of Global 3D Human Pose and Shape in the Wild. In IEEE/CVF International Conference on Computer Vision (ICCV), pages 14632–14643, 2023.
- Hand pose estimation and hand shape classification using multi-layered randomized decision forests. In European Conference on Computer Vision (ECCV), pages 852–863. Springer, 2012.
- Vibe: Video inference for human body pose and shape estimation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5253–5263, 2020.
- PARE: Part attention regressor for 3D human body estimation. In IEEE/CVF International Conference on Computer Vision (ICCV), pages 11127–11137, 2021.
- Learning to reconstruct 3d human pose and shape via model-fitting in the loop. In IEEE/CVF International Conference on Computer Vision (ICCV), pages 2252–2261, 2019.
- Convolutional mesh regression for single-image human shape reconstruction. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4501–4510, 2019.
- Probabilistic modeling for human mesh recovery. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11605–11614, 2021.
- Unite the people: Closing the loop between 3d and 2d human representations. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6050–6059, 2017.
- Simcc: A simple coordinate classification perspective for human pose estimation. In European Conference on Computer Vision (ECCV), pages 89–106. Springer, 2022.
- Cliff: Carrying location information in full frames into human pose and shape estimation. In European Conference on Computer Vision (ECCV), pages 590–606. Springer, 2022.
- One-stage 3d whole-body mesh recovery with component aware transformer. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21159–21168, 2023.
- End-to-end human pose and mesh reconstruction with transformers. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1954–1963, 2021.
- Mesh graphormer. In IEEE/CVF International Conference on Computer Vision (ICCV), pages 12939–12948, 2021.
- Smpl: A skinned multi-person linear model. ACM Transactions on Graphics (TOG), 34(6), 2015.
- Posegpt: Quantization-based 3d human motion generation and forecasting. In European Conference on Computer Vision (ECCV), pages 417–435. Springer, 2022.
- 3d human motion estimation via motion compression and refinement. In Asian Conference on Computer Vision (ACCV), 2020.
- Amass: Archive of motion capture as surface shapes. In IEEE/CVF International Conference on Computer Vision (ICCV), pages 5442–5451, 2019.
- Neuralannot: Neural annotator for 3d human mesh training sets. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2299–2307, 2022.
- G. Moon and K. M. Lee. I2l-meshnet: Image-to-lixel prediction network for accurate 3d human pose and mesh estimation from a single rgb image. In European Conference on Computer Vision (ECCV), pages 752–768. Springer, 2020.
- SUPR: A sparse unified part-based human representation. In European Conference on Computer Vision (ECCV), pages 568–585. Springer, 2022.
- AGORA: Avatars in geography optimized for regression analysis. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13468–13478, 2021.
- Expressive body capture: 3d hands, face, and body from a single image. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10975–10985, 2019.
- Humor: 3d human motion model for robust pose estimation. In IEEE/CVF International Conference on Computer Vision (ICCV), pages 11488–11499, 2021.
- G. Rogez and C. Schmid. Mocap-guided data augmentation for 3d pose estimation in the wild. Advances in neural information processing systems (NIPS), 29:3108–3116, 2016.
- Lcr-net++: Multi-person 2d and 3d pose detection in natural images. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 42(5):1146–1161, 2019.
- Imagenet large scale visual recognition challenge. International Journal of Computer Vision (IJCV), 115:211–252, 2015.
- Face shape classification from 3d human data by using svm. In Biomedical Engineering International Conference, pages 1–5. IEEE, 2014.
- Probabilistic 3d human shape and pose estimation from multiple unconstrained images in the wild. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16094–16104, 2021.
- Humaniflow: Ancestor-conditioned normalising flows on so (3) manifolds for human pose and shape distribution estimation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4779–4789, 2023.
- Phasemp: Robust 3d pose estimation via phase-conditioned human motion prior. In IEEE/CVF International Conference on Computer Vision (ICCV), pages 14725–14737, 2023.
- Bailando: 3d dance generation by actor-critic gpt with choreographic memory. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11050–11059, 2022.
- Deepphase: Periodic autoencoders for learning motion phase manifolds. ACM Transactions on Graphics (TOG), 41(4), 2022.
- Pose-ndf: Modeling human pose manifolds with neural distance fields. In European Conference on Computer Vision (ECCV), pages 572–589. Springer, 2022.
- Neural discrete representation learning. Advances in neural information processing systems (NIPS), 30:6306–6315, 2017.
- Learning from synthetic humans. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 109–117, 2017.
- Attention is all you need. Advances in Neural Information Processing Systems (NIPS), 30:5998–6008, 2017.
- Recovering accurate 3d human pose in the wild using imus and a moving camera. In European Conference on Computer Vision (ECCV), pages 601–617. Springer, 2018.
- D. Wang and S. Zhang. 3d human mesh recovery with sequentially global rotation estimation. In IEEE/CVF International Conference on Computer Vision (ICCV), pages 14953–14962, 2023.
- Ai coach: Deep human pose estimation and analysis for personalized athletic training assistance. In ACM International Conference on Multimedia (ACM MM), pages 374–382, 2019.
- Deep high-resolution representation learning for visual recognition. IEEE transactions on Pattern Analysis and Machine Intelligence (PAMI), 43(10):3349–3364, 2020.
- Ghum & ghuml: Generative 3d human shape and articulated pose models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6184–6193, 2020.
- 3d human shape and pose from a single low-resolution image with self-supervised learning. In European Conference on Computer Vision (ECCV), pages 284–300. Springer, 2020.
- 3d human pose, shape and texture from low-resolution images and videos. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 44(9):4490–4504, 2021.
- Qpgesture: Quantization-based and phase-guided motion matching for natural speech-driven gesture generation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2321–2330, 2023.
- Monocular 3d pose and shape estimation of multiple people in natural scenes-the importance of multiple scene constraints. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2148–2157, 2018.
- Learning 3d human shape and pose from dense body parts. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 44(5):2610–2627, 2020.
- Pymaf: 3d human pose and shape regression with pyramidal mesh alignment feedback loop. In IEEE/CVF International Conference on Computer Vision (ICCV), pages 11446–11456, 2021.
- Learning physically simulated tennis skills from broadcast videos. ACM Transactions on Graphics (TOG), 42(4), 2023.
- T2m-gpt: Generating human motion from textual descriptions with discrete representations. arXiv preprint arXiv:2301.06052, 2023.
- Potter: Pooling attention transformer for efficient human mesh recovery. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1611–1620, 2023.
- On the continuity of rotation representations in neural networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5745–5753, 2019.
- Fully convolutional mesh autoencoder using efficient spatially varying kernels. Advances in neural information processing systems (NIPS), 33:9251–9262, 2020.