Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Human Mesh Recovery from Monocular Images via a Skeleton-disentangled Representation (1908.07172v2)

Published 20 Aug 2019 in cs.CV

Abstract: We describe an end-to-end method for recovering 3D human body mesh from single images and monocular videos. Different from the existing methods try to obtain all the complex 3D pose, shape, and camera parameters from one coupling feature, we propose a skeleton-disentangling based framework, which divides this task into multi-level spatial and temporal granularity in a decoupling manner. In spatial, we propose an effective and pluggable "disentangling the skeleton from the details" (DSD) module. It reduces the complexity and decouples the skeleton, which lays a good foundation for temporal modeling. In temporal, the self-attention based temporal convolution network is proposed to efficiently exploit the short and long-term temporal cues. Furthermore, an unsupervised adversarial training strategy, temporal shuffles and order recovery, is designed to promote the learning of motion dynamics. The proposed method outperforms the state-of-the-art 3D human mesh recovery methods by 15.4% MPJPE and 23.8% PA-MPJPE on Human3.6M. State-of-the-art results are also achieved on the 3D pose in the wild (3DPW) dataset without any fine-tuning. Especially, ablation studies demonstrate that skeleton-disentangled representation is crucial for better temporal modeling and generalization.

Citations (176)

Summary

  • The paper introduces a novel DSD module that decouples skeletal structure from detailed pose features, significantly boosting 3D mesh accuracy.
  • It leverages a Self-Attention Temporal Network combining self-attention with temporal convolution to effectively capture both short and long-term motion dynamics.
  • Evaluations on Human3.6M and 3DPW benchmarks demonstrate state-of-the-art performance with reduced MPJPE and PA-MPJPE without dataset-specific tuning.

Human Mesh Recovery from Monocular Images via a Skeleton-disentangled Representation

This paper presents an innovative approach to recovering 3D human body meshes from monocular images and videos by utilizing a skeleton-disentangled representation. The framework aims to improve the accuracy and stability of human mesh recovery by addressing inherent challenges such as information loss in 2D projections and the complexity of human pose variations.

The paper introduces a novel module, termed "Disentangling the Skeleton from the Details" (DSD), which effectively separates the skeletal structure from the detailed pose information and body shape. By employing bilinear transformation, the DSD module enables more accurate extraction of skeleton features, thus facilitating a reduction in network complexity and enhancing feature decoupling. The insights from the experiments demonstrate that this disentangled representation significantly boosts the prediction accuracy of human body meshes, outperforming existing methods by notably reducing MPJPE and PA-MPJPE on benchmark datasets.

In temporal modeling, the authors propose a Self-Attention Temporal Network (SATN), combining self-attention mechanisms with Temporal Convolution Networks (TCN). This hybrid approach efficiently captures both short and long-term temporal cues inherent in video sequences, offering improved motion dynamics modeling. Additionally, the paper discusses an unsupervised adversarial training strategy that promotes efficient learning of motion dynamics by leveraging temporal sequence order recovery, thus enriching the temporal feature representation.

A rigorous evaluation of the proposed methods on Human3.6M and 3DPW datasets validates its effectiveness, showcasing state-of-the-art results without requiring dataset-specific fine-tuning. Ablation studies further emphasize the critical role of skeleton-disentangled representation in enhancing temporal modeling capabilities.

The implications of this research are substantial, offering potential advancements in areas such as virtual human modeling, motion capture, and other computer vision applications requiring human mesh data. Future work could explore further integration with multi-view systems or refined unsupervised learning paradigms to expand its applicability and improve resilience to varied environmental conditions.

Overall, this paper presents a robust and efficient methodology for advancing the field of human mesh recovery, setting the stage for future developments in accurate and scalable 3D representation techniques.