Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
157 tokens/sec
GPT-4o
43 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

TokenHMR: Advancing Human Mesh Recovery with a Tokenized Pose Representation (2404.16752v1)

Published 25 Apr 2024 in cs.CV

Abstract: We address the problem of regressing 3D human pose and shape from a single image, with a focus on 3D accuracy. The current best methods leverage large datasets of 3D pseudo-ground-truth (p-GT) and 2D keypoints, leading to robust performance. With such methods, we observe a paradoxical decline in 3D pose accuracy with increasing 2D accuracy. This is caused by biases in the p-GT and the use of an approximate camera projection model. We quantify the error induced by current camera models and show that fitting 2D keypoints and p-GT accurately causes incorrect 3D poses. Our analysis defines the invalid distances within which minimizing 2D and p-GT losses is detrimental. We use this to formulate a new loss Threshold-Adaptive Loss Scaling (TALS) that penalizes gross 2D and p-GT losses but not smaller ones. With such a loss, there are many 3D poses that could equally explain the 2D evidence. To reduce this ambiguity we need a prior over valid human poses but such priors can introduce unwanted bias. To address this, we exploit a tokenized representation of human pose and reformulate the problem as token prediction. This restricts the estimated poses to the space of valid poses, effectively providing a uniform prior. Extensive experiments on the EMDB and 3DPW datasets show that our reformulated keypoint loss and tokenization allows us to train on in-the-wild data while improving 3D accuracy over the state-of-the-art. Our models and code are available for research at https://tokenhmr.is.tue.mpg.de.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (62)
  1. Pose-conditioned joint angle limits for 3D human pose reconstruction. In Computer Vision and Pattern Recognition (CVPR), 2015.
  2. 2d human pose estimation: New benchmark and state of the art analysis. In Computer Vision and Pattern Recognition (CVPR), 2014.
  3. Bedlam: A synthetic dataset of bodies exhibiting detailed lifelike animated motion. In CVPR, pages 8726–8737, 2023.
  4. Keep it SMPL: Automatic estimation of 3D human pose and shape from a single image. In European Conference on Computer Vision (ECCV), 2016.
  5. Openpose: Realtime multi-person 2d pose estimation using part affinity fields. IEEE Transaction on Pattern Analysis and Machine Intelligence (TPAMI), 2019.
  6. Learning to estimate robust 3d human mesh from in-the-wild crowded scenes. In Computer Vision and Pattern Recognition (CVPR), pages 1475–1484, 2022.
  7. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (ICLR), 2021.
  8. Learning to regress bodies from images using differentiable semantic rendering. In International Conference on Computer Vision (ICCV), 2021.
  9. POCO: 3D pose and shape estimation using confidence. In International Conference on 3D Vision (3DV), 2024.
  10. Human pose as compositional tokens. In Computer Vision and Pattern Recognition (CVPR), 2023.
  11. Hierarchical kinematic human mesh recovery. European Conference on Computer Vision (ECCV), 2020.
  12. Humans in 4D: Reconstructing and tracking humans with transformers. In International Conference on Computer Vision (ICCV), 2023.
  13. Ava: A video dataset of spatio-temporally localized atomic visual actions. In Computer Vision and Pattern Recognition (CVPR), pages 6047–6056, 2018.
  14. Tm2t: Stochastic and tokenized modeling for the reciprocal generation of 3d human motions and texts. In European Conference on Computer Vision (ECCV), 2022.
  15. Deep residual learning for image recognition. In Computer Vision and Pattern Recognition (CVPR), 2016.
  16. Gaussian error linear units (gelus). arXiv preprint arXiv: 1606.08415, 2016.
  17. Human3.6M: Large scale datasets and predictive methods for 3D human sensing in natural environments. Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2014.
  18. Motiongpt: Human motion as a foreign language. arXiv preprint arXiv: 2306.14795, 2023.
  19. Learning effective human pose estimation from inaccurate annotation. In Computer Vision and Pattern Recognition (CVPR), 2011.
  20. Exemplar fine-tuning for 3D human pose fitting towards in-the-wild 3D human pose estimation. In International Conference on 3D Vision (3DV), 2020.
  21. End-to-end recovery of human shape and pose. In Computer Vision and Pattern Recognition (CVPR), 2018.
  22. Learning 3D human dynamics from video. In Computer Vision and Pattern Recognition (CVPR), 2019.
  23. EMDB: The Electromagnetic Database of Global 3D Human Pose and Shape in the Wild. In International Conference on Computer Vision (ICCV), 2023.
  24. VIBE: Video inference for human body pose and shape estimation. In Computer Vision and Pattern Recognition (CVPR), 2020.
  25. PARE: Part attention regressor for 3D human body estimation. In International Conference on Computer Vision (ICCV), 2021a.
  26. SPEC: Seeing people in the wild with an estimated camera. In International Conference on Computer Vision (ICCV), 2021b.
  27. Learning to reconstruct 3D human pose and shape via model-fitting in the loop. In International Conference on Computer Vision (ICCV), 2019a.
  28. Convolutional mesh regression for single-image human shape reconstruction. In Computer Vision and Pattern Recognition (CVPR), 2019b.
  29. Probabilistic modeling for human mesh recovery. In International Conference on Computer Vision (ICCV), 2021.
  30. Unite the people: Closing the loop between 3D and 2D human representations. In Computer Vision and Pattern Recognition (CVPR), 2017.
  31. HybrIK: A hybrid analytical-neural inverse kinematics solution for 3D human pose and shape estimation. In Computer Vision and Pattern Recognition (CVPR), 2021.
  32. CLIFF: Carrying location information in full frames into human pose and shape estimation. In European Conference on Computer Vision (ECCV), 2022.
  33. One-stage 3D whole-body mesh recovery with component aware transformer. In Computer Vision and Pattern Recognition (CVPR), 2023.
  34. Mesh graphormer. In International Conference on Computer Vision (ICCV), 2021.
  35. Microsoft coco: Common objects in context. In European Conference on Computer Vision (ECCV), 2014.
  36. SMPL: A skinned multi-person linear model. In Transactions on Graphics (TOG), 2015.
  37. AMASS: Archive of motion capture as surface shapes. In International Conference on Computer Vision (ICCV), 2019.
  38. Monocular 3d human pose estimation in the wild using improved cnn supervision. In International Conference on 3D Vision (3DV), 2017.
  39. COAP: Compositional articulated occupancy of people. In Computer Vision and Pattern Recognition (CVPR), 2022.
  40. Neuralannot: Neural annotator for 3d human mesh training sets. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2299–2307, 2022.
  41. Neural body fitting: Unifying deep learning and model based human pose and shape estimation. In International Conference on 3D Vision (3DV), 2018.
  42. Expressive body capture: 3D hands, face, and body from a single image. In Computer Vision and Pattern Recognition (CVPR), 2019.
  43. Zero-shot text-to-image generation. International Conference on Machine Learning (ICML), 2021.
  44. Generating diverse high-fidelity images with vq-vae-2. Conference on Neural Information Processing Systems (NeurIPS), 2019.
  45. PIFuHD: Multi-level pixel-aligned implicit function for high-resolution 3D human digitization. In Computer Vision and Pattern Recognition (CVPR), 2020.
  46. Metric-scale truncation-robust heatmaps for 3D human pose estimation. In IEEE Int Conf Automatic Face and Gesture Recognition (FG), 2020.
  47. HuManiFlow: Ancestor-Conditioned Normalising Flows on SO(3) Manifolds for Human Pose and Shape Distribution Estimation. In Computer Vision and Pattern Recognition (CVPR), 2023.
  48. Monocular, One-stage, Regression of Multiple 3D People. In International Conference on Computer Vision (ICCV), 2021.
  49. Putting People in their Place: Monocular Regression of 3D People in Depth. In Computer Vision and Pattern Recognition (CVPR), 2022.
  50. Human motion diffusion model. In The Eleventh International Conference on Learning Representations, 2023.
  51. Pose-ndf: Modeling human pose manifolds with neural distance fields. In European Conference on Computer Vision (ECCV), 2022.
  52. 3D human pose estimation via intuitive physics. In Computer Vision and Pattern Recognition (CVPR), 2023.
  53. Neural discrete representation learning. Conference on Neural Information Processing Systems (NeurIPS), 2017.
  54. Attention is all you need. Conference on Neural Information Processing Systems (NeurIPS), 2017.
  55. Recovering accurate 3D human pose in the wild using IMUs and a moving camera. In European Conference on Computer Vision (ECCV), 2018.
  56. Zolly: Zoom focal length correctly for perspective-distorted human mesh reconstruction. In International Conference on Computer Vision (ICCV), 2023.
  57. Ai challenger: A large-scale dataset for going deeper in image understanding. arXiv, 2017.
  58. ECON: Explicit clothed humans optimized via normal integration. In Computer Vision and Pattern Recognition (CVPR), 2023.
  59. GHUM & GHUML: Generative 3D human shape and articulated pose models. In Computer Vision and Pattern Recognition (CVPR), 2020.
  60. ViTPose: Simple vision transformer baselines for human pose estimation. In Conference on Neural Information Processing Systems (NeurIPS), 2022.
  61. Pymaf: 3d human pose and shape regression with pyramidal mesh alignment feedback loop. In International Conference on Computer Vision (ICCV), pages 11446–11456, 2021.
  62. T2m-gpt: Generating human motion from textual descriptions with discrete representations. In Computer Vision and Pattern Recognition (CVPR), 2023.
Citations (9)

Summary

  • The paper introduces TokenHMR, which uses Threshold-Adaptive Loss Scaling (TALS) to counteract camera projection biases and improve 3D pose accuracy.
  • It employs a tokenized pose representation with a Vector Quantized-VAE that confines predictions to a finite set of valid human poses, enhancing robustness.
  • Experimental results on benchmarks like EMDB and 3DPW show that TokenHMR significantly reduces 3D errors compared to previous state-of-the-art models.

Advancing Human Mesh Recovery with Tokenized Pose Representation

Introduction

The ongoing challenge in the field of 3D human pose and shape (HPS) estimation from single images revolves around achieving high accuracy in both the estimated 3D pose and its alignment with 2D images. Recent advances often face a paradox where accuracy in 2D keypoint predictions adversely impacts the accuracy in 3D pose predictions. Key contributors to this problem include biases inherent in pseudo-ground-truth data and the discrepancies introduced by approximate camera projection models. The paper presents a novel methodology, TokenHMR, which introduces the Threshold-Adaptive Loss Scaling (TALS) and a tokenized representation of human pose aiming to mitigate these issues, thus setting a new benchmark in 3D HPS estimation.

Analysis of Key Challenges

Existing methods in 3D HPS often rely on minimizing 2D keypoint loss, which inadvertently leads to inaccuracies in 3D pose predictions due to camera model approximations. This is vividly demonstrated using the BEDLAM dataset where using ground-truth 3D poses results in significant projection errors when viewed through the lens of assumed (incorrect) camera parameters. These errors significantly showcased how the high 2D fitting accuracy could lead to large deviations in 3D pose accuracy.

Innovations in TokenHMR

Threshold-Adaptive Loss Scaling (TALS)

TALS addresses the core issue where minimizing loss below a certain error threshold, due to camera biases, does not contribute to improving 3D pose accuracy and could potentially degrade it. By implementing an adaptive loss function, TALS differentially scales the loss penalties based on whether they exceed a predefined threshold informed by baseline errors from ground-truth data.

Tokenization of Human Pose

To further reduce ambiguities in predicting 3D pose from 2D keypoints, TokenHMR introduces a token-based representation system. Utilizing a Vector Quantized-Variational AutoEncoder, the system discretizes the potential human poses into a series of tokens derived from extensive motion capture data. This approach restricts the model's outputs to a finite set of valid poses, enhancing both accuracy and robustness against occlusion or partial data visibility.

Experimental Results

Detailed experiments highlight the strengths of TokenHMR across various benchmarks such as EMDB and 3DPW. When deployed on 3D benchmarks, TokenHMR outperforms existing state-of-the-art models, including HMR2.0, by substantial margins in various metrics such as 3D error reductions. The evaluation clearly indicates the advantage of the new tokenized pose representation system and the TALS method in achieving more accurate 3D human pose estimations.

Implications and Future Directions

TokenHMR not only sets new precedents in the accuracy of 3D HPS models but opens numerous avenues for future research. The tokenization of human poses offers an interesting parallel to LLMs where a limited vocabulary (tokens) can effectively represent a vast space of information (human poses). Further exploration into more refined tokenization techniques, as well as expanding the adaptability of the TALS approach across different model architectures, could provide deeper insights and improvements. Additionally, integrating more accurate camera models or dynamic models that can adapt to input data could further enhance the performance of 3D HPS systems.

In conclusion, the paper successfully addresses a significant challenge in 3D human pose estimation and introduces innovative methods that significantly mitigate biases induced by 2D projection errors, propelling the field toward more accurate and robust HPS prediction models.

X Twitter Logo Streamline Icon: https://streamlinehq.com