Papers

Topics

Authors

Recent

View all

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 77 tok/s

Gemini 2.5 Pro 52 tok/s Pro

GPT-5 Medium 30 tok/s Pro

GPT-5 High 31 tok/s Pro

GPT-4o 91 tok/s Pro

Kimi K2 178 tok/s Pro

GPT OSS 120B 385 tok/s Pro

Claude Sonnet 4 38 tok/s Pro

2000 character limit reached

GenHMR: Generative Human Mesh Recovery (2412.14444v1)

Published 19 Dec 2024 in cs.CV, cs.AI, cs.GR, and cs.LG

Abstract: Human mesh recovery (HMR) is crucial in many computer vision applications; from health to arts and entertainment. HMR from monocular images has predominantly been addressed by deterministic methods that output a single prediction for a given 2D image. However, HMR from a single image is an ill-posed problem due to depth ambiguity and occlusions. Probabilistic methods have attempted to address this by generating and fusing multiple plausible 3D reconstructions, but their performance has often lagged behind deterministic approaches. In this paper, we introduce GenHMR, a novel generative framework that reformulates monocular HMR as an image-conditioned generative task, explicitly modeling and mitigating uncertainties in the 2D-to-3D mapping process. GenHMR comprises two key components: (1) a pose tokenizer to convert 3D human poses into a sequence of discrete tokens in a latent space, and (2) an image-conditional masked transformer to learn the probabilistic distributions of the pose tokens, conditioned on the input image prompt along with randomly masked token sequence. During inference, the model samples from the learned conditional distribution to iteratively decode high-confidence pose tokens, thereby reducing 3D reconstruction uncertainties. To further refine the reconstruction, a 2D pose-guided refinement technique is proposed to directly fine-tune the decoded pose tokens in the latent space, which forces the projected 3D body mesh to align with the 2D pose clues. Experiments on benchmark datasets demonstrate that GenHMR significantly outperforms state-of-the-art methods. Project website can be found at https://m-usamasaleem.github.io/publication/GenHMR/GenHMR.html

Summary

The paper introduces GenHMR, a novel generative framework that reformulates human mesh recovery as an image-conditioned task, addressing depth ambiguity and occlusion.
The model employs a pose tokenizer with VQ-VAE and an image-conditioned masked transformer to iteratively refine pose predictions using uncertainty-guided sampling and 2D pose-guided refinement.
Experimental results show significant error reduction in MPJPE and PA-MPJPE metrics, outperforming state-of-the-art methods on datasets like Human3.6M, 3DPW, and EMDB.

GenHMR: Generative Human Mesh Recovery

Introduction

The paper presents GenHMR, a novel approach to human mesh recovery from monocular images, reformulated as an image-conditioned generative task. The technique aims to address the inherent challenges of depth ambiguity and occlusion in 2D-to-3D mapping. GenHMR incorporates a pose tokenizer and an image-conditional masked transformer to model the probabilistic distributions of pose tokens, improving the accuracy of 3D human mesh reconstruction from single images.

Model Architecture

GenHMR uses two key components:

Pose Tokenizer: Utilizes Vector Quantized Variational Autoencoders (VQ-VAE) to convert continuous human pose parameters into a sequence of discrete tokens. This enables discretization of the pose space, capturing the essential features needed for accurate human mesh reconstruction.
Image-Conditioned Masked Transformer: Trains on sequences with randomly masked tokens, learning the conditional categorical distribution of pose tokens. This transformer effectively models probabilistic mappings, aiding in refining pose estimations by leveraging image cues.
Figure 1: GenHMR Training Phase. GenHMR consists of two key components: (1) a Pose Tokenizer that encodes 3D human poses into a sequence of discrete tokens within a latent space, and (2) an Image-Conditioned Masked Transformer that models the probabilistic distributions of these tokens, conditioned on the input image and a partially masked token sequence.

Inference Strategy

The inference process consists of two main stages:

Uncertainty-Guided Sampling: Iteratively decodes high-confidence pose tokens by sampling from learned distributions. Uncertain tokens are re-masked and refined over successive iterations, facilitating the generation of robust 3D reconstructions.
2D Pose-Guided Refinement: Further optimizes the pose tokens by ensuring consistency between the 3D body mesh and 2D pose estimates. This step significantly reduces reconstruction uncertainty and ensures alignment with detected 2D features.
Figure 2: Our inference strategy comprises two key stages: (1) Uncertainty-Guided Sampling, which iteratively samples high-confidence pose tokens based on their probabilistic distributions, and (2) 2D Pose-Guided Refinement, which fine-tunes the sampled pose tokens to further minimize 3D reconstruction uncertainty by ensuring consistency between the 3D body mesh and 2D pose estimates.

Experimental Results

Experiments demonstrate GenHMR's superiority in 3D reconstruction accuracy compared to state-of-the-art methods on datasets like Human3.6M, 3DPW, and EMDB. Notably, it achieves significant error reduction in terms of MPJPE and PA-MPJPE, showcasing its effectiveness in handling complex poses and ambiguous scenarios.

Figure 3: Impact of 2D Pose-Guided Refinement on 3D pose reconstruction. Red circles highlight areas of errors after each refinement iteration, which showcases how the method progressively refines these poses. By fine-tuning pose tokens to align the 3D pose with 2D detections, our method iteratively reduces uncertainties and improves accuracy. Significant improvements are seen in the early iterations, with errors largely minimized at the 10th iteration. Note that the initial mesh comes from UGS.

Ablation Studies

GenHMR's architecture was subjected to extensive ablation studies, evaluating the effects of token quantity, codebook size, and inference iterations on performance. Findings underscore the importance of generative masking and iterative refinement techniques in driving the improvements observed in human mesh reconstructions.

Conclusion

GenHMR marks a departure from traditional deterministic and probabilistic approaches to human mesh recovery, offering a sophisticated generative framework that excels in both controlled and in-the-wild conditions. The model's ability to iteratively refine pose estimations, combined with innovative inference strategies, ensures a high degree of accuracy and robustness in complex scenarios. The advancements seen with GenHMR imply promising directions for further exploration and application in AI-driven human pose analysis.