TokenHMR: Advancing Human Mesh Recovery with a Tokenized Pose Representation (2404.16752v1)
Abstract: We address the problem of regressing 3D human pose and shape from a single image, with a focus on 3D accuracy. The current best methods leverage large datasets of 3D pseudo-ground-truth (p-GT) and 2D keypoints, leading to robust performance. With such methods, we observe a paradoxical decline in 3D pose accuracy with increasing 2D accuracy. This is caused by biases in the p-GT and the use of an approximate camera projection model. We quantify the error induced by current camera models and show that fitting 2D keypoints and p-GT accurately causes incorrect 3D poses. Our analysis defines the invalid distances within which minimizing 2D and p-GT losses is detrimental. We use this to formulate a new loss Threshold-Adaptive Loss Scaling (TALS) that penalizes gross 2D and p-GT losses but not smaller ones. With such a loss, there are many 3D poses that could equally explain the 2D evidence. To reduce this ambiguity we need a prior over valid human poses but such priors can introduce unwanted bias. To address this, we exploit a tokenized representation of human pose and reformulate the problem as token prediction. This restricts the estimated poses to the space of valid poses, effectively providing a uniform prior. Extensive experiments on the EMDB and 3DPW datasets show that our reformulated keypoint loss and tokenization allows us to train on in-the-wild data while improving 3D accuracy over the state-of-the-art. Our models and code are available for research at https://tokenhmr.is.tue.mpg.de.
- Pose-conditioned joint angle limits for 3D human pose reconstruction. In Computer Vision and Pattern Recognition (CVPR), 2015.
- 2d human pose estimation: New benchmark and state of the art analysis. In Computer Vision and Pattern Recognition (CVPR), 2014.
- Bedlam: A synthetic dataset of bodies exhibiting detailed lifelike animated motion. In CVPR, pages 8726–8737, 2023.
- Keep it SMPL: Automatic estimation of 3D human pose and shape from a single image. In European Conference on Computer Vision (ECCV), 2016.
- Openpose: Realtime multi-person 2d pose estimation using part affinity fields. IEEE Transaction on Pattern Analysis and Machine Intelligence (TPAMI), 2019.
- Learning to estimate robust 3d human mesh from in-the-wild crowded scenes. In Computer Vision and Pattern Recognition (CVPR), pages 1475–1484, 2022.
- An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (ICLR), 2021.
- Learning to regress bodies from images using differentiable semantic rendering. In International Conference on Computer Vision (ICCV), 2021.
- POCO: 3D pose and shape estimation using confidence. In International Conference on 3D Vision (3DV), 2024.
- Human pose as compositional tokens. In Computer Vision and Pattern Recognition (CVPR), 2023.
- Hierarchical kinematic human mesh recovery. European Conference on Computer Vision (ECCV), 2020.
- Humans in 4D: Reconstructing and tracking humans with transformers. In International Conference on Computer Vision (ICCV), 2023.
- Ava: A video dataset of spatio-temporally localized atomic visual actions. In Computer Vision and Pattern Recognition (CVPR), pages 6047–6056, 2018.
- Tm2t: Stochastic and tokenized modeling for the reciprocal generation of 3d human motions and texts. In European Conference on Computer Vision (ECCV), 2022.
- Deep residual learning for image recognition. In Computer Vision and Pattern Recognition (CVPR), 2016.
- Gaussian error linear units (gelus). arXiv preprint arXiv: 1606.08415, 2016.
- Human3.6M: Large scale datasets and predictive methods for 3D human sensing in natural environments. Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2014.
- Motiongpt: Human motion as a foreign language. arXiv preprint arXiv: 2306.14795, 2023.
- Learning effective human pose estimation from inaccurate annotation. In Computer Vision and Pattern Recognition (CVPR), 2011.
- Exemplar fine-tuning for 3D human pose fitting towards in-the-wild 3D human pose estimation. In International Conference on 3D Vision (3DV), 2020.
- End-to-end recovery of human shape and pose. In Computer Vision and Pattern Recognition (CVPR), 2018.
- Learning 3D human dynamics from video. In Computer Vision and Pattern Recognition (CVPR), 2019.
- EMDB: The Electromagnetic Database of Global 3D Human Pose and Shape in the Wild. In International Conference on Computer Vision (ICCV), 2023.
- VIBE: Video inference for human body pose and shape estimation. In Computer Vision and Pattern Recognition (CVPR), 2020.
- PARE: Part attention regressor for 3D human body estimation. In International Conference on Computer Vision (ICCV), 2021a.
- SPEC: Seeing people in the wild with an estimated camera. In International Conference on Computer Vision (ICCV), 2021b.
- Learning to reconstruct 3D human pose and shape via model-fitting in the loop. In International Conference on Computer Vision (ICCV), 2019a.
- Convolutional mesh regression for single-image human shape reconstruction. In Computer Vision and Pattern Recognition (CVPR), 2019b.
- Probabilistic modeling for human mesh recovery. In International Conference on Computer Vision (ICCV), 2021.
- Unite the people: Closing the loop between 3D and 2D human representations. In Computer Vision and Pattern Recognition (CVPR), 2017.
- HybrIK: A hybrid analytical-neural inverse kinematics solution for 3D human pose and shape estimation. In Computer Vision and Pattern Recognition (CVPR), 2021.
- CLIFF: Carrying location information in full frames into human pose and shape estimation. In European Conference on Computer Vision (ECCV), 2022.
- One-stage 3D whole-body mesh recovery with component aware transformer. In Computer Vision and Pattern Recognition (CVPR), 2023.
- Mesh graphormer. In International Conference on Computer Vision (ICCV), 2021.
- Microsoft coco: Common objects in context. In European Conference on Computer Vision (ECCV), 2014.
- SMPL: A skinned multi-person linear model. In Transactions on Graphics (TOG), 2015.
- AMASS: Archive of motion capture as surface shapes. In International Conference on Computer Vision (ICCV), 2019.
- Monocular 3d human pose estimation in the wild using improved cnn supervision. In International Conference on 3D Vision (3DV), 2017.
- COAP: Compositional articulated occupancy of people. In Computer Vision and Pattern Recognition (CVPR), 2022.
- Neuralannot: Neural annotator for 3d human mesh training sets. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2299–2307, 2022.
- Neural body fitting: Unifying deep learning and model based human pose and shape estimation. In International Conference on 3D Vision (3DV), 2018.
- Expressive body capture: 3D hands, face, and body from a single image. In Computer Vision and Pattern Recognition (CVPR), 2019.
- Zero-shot text-to-image generation. International Conference on Machine Learning (ICML), 2021.
- Generating diverse high-fidelity images with vq-vae-2. Conference on Neural Information Processing Systems (NeurIPS), 2019.
- PIFuHD: Multi-level pixel-aligned implicit function for high-resolution 3D human digitization. In Computer Vision and Pattern Recognition (CVPR), 2020.
- Metric-scale truncation-robust heatmaps for 3D human pose estimation. In IEEE Int Conf Automatic Face and Gesture Recognition (FG), 2020.
- HuManiFlow: Ancestor-Conditioned Normalising Flows on SO(3) Manifolds for Human Pose and Shape Distribution Estimation. In Computer Vision and Pattern Recognition (CVPR), 2023.
- Monocular, One-stage, Regression of Multiple 3D People. In International Conference on Computer Vision (ICCV), 2021.
- Putting People in their Place: Monocular Regression of 3D People in Depth. In Computer Vision and Pattern Recognition (CVPR), 2022.
- Human motion diffusion model. In The Eleventh International Conference on Learning Representations, 2023.
- Pose-ndf: Modeling human pose manifolds with neural distance fields. In European Conference on Computer Vision (ECCV), 2022.
- 3D human pose estimation via intuitive physics. In Computer Vision and Pattern Recognition (CVPR), 2023.
- Neural discrete representation learning. Conference on Neural Information Processing Systems (NeurIPS), 2017.
- Attention is all you need. Conference on Neural Information Processing Systems (NeurIPS), 2017.
- Recovering accurate 3D human pose in the wild using IMUs and a moving camera. In European Conference on Computer Vision (ECCV), 2018.
- Zolly: Zoom focal length correctly for perspective-distorted human mesh reconstruction. In International Conference on Computer Vision (ICCV), 2023.
- Ai challenger: A large-scale dataset for going deeper in image understanding. arXiv, 2017.
- ECON: Explicit clothed humans optimized via normal integration. In Computer Vision and Pattern Recognition (CVPR), 2023.
- GHUM & GHUML: Generative 3D human shape and articulated pose models. In Computer Vision and Pattern Recognition (CVPR), 2020.
- ViTPose: Simple vision transformer baselines for human pose estimation. In Conference on Neural Information Processing Systems (NeurIPS), 2022.
- Pymaf: 3d human pose and shape regression with pyramidal mesh alignment feedback loop. In International Conference on Computer Vision (ICCV), pages 11446–11456, 2021.
- T2m-gpt: Generating human motion from textual descriptions with discrete representations. In Computer Vision and Pattern Recognition (CVPR), 2023.