- The paper provides a comprehensive review of deep learning methods for 3D human pose estimation and mesh recovery, covering both single- and multi-person scenarios.
- It details the use of explicit models, implicit representations, various sensors, and advanced networks like ResNet and Transformers for improved reconstruction.
- The study highlights key challenges and future directions, including handling occlusions, enhancing reconstruction detail, and optimizing model performance for real-world applications.
The survey paper "Deep Learning for 3D Human Pose Estimation and Mesh Recovery: A Survey" provides a comprehensive review of deep learning methods for 3D Human Pose Estimation (HPE) and Human Mesh Recovery (HMR) over the past five years. The survey covers methods for both single-person and multi-person HPE, as well as HMR techniques based on explicit models and implicit representations.
The paper highlights the importance of 3D HPE and HMR in understanding human behavior in various applications, including computer vision, autonomous driving, and virtual reality. It notes that while 3D HPE accurately predicts human body keypoint coordinates in three-dimensional space, HMR reconstructs a three-dimensional digital model of the body, capturing details such as shape, gestures, clothing, and facial expressions.
The survey emphasizes the increasing attention garnered by 3D HPE and HMR due to advancements in deep learning technology. It acknowledges the evolution of 3D pose estimation from single individuals to multiple persons with more varied data inputs. In HMR, progress has been made in terms of data inputs and capturing more intricate details. The paper also points out that both 3D pose estimation and mesh recovery face significant challenges, such as multi-person scenarios, self-occlusion issues, and detailed reconstruction of bodies.
The paper categorizes sensors used for 3D HPE and HMR into active and passive sensors. Active sensors emit signals and measure reflections, including Motion Capture (MoCap) systems, Time of Flight (ToF) cameras, and Radio Frequency (RF) technologies. Passive sensors rely on signals from objects or natural sources, including Inertial Measurement Units (IMUs) and image sensors. The survey focuses on using RGB image sensors due to their widespread applicability.
The survey discusses the representation of the human body, including 3D coordinates detailing positions and joint orientations and statistical-based models such as SCAPE and SMPL. SMPL is a learnable skin-vertex model representing the human body as a 3D mesh with pose parameters θ and shape parameters β.
θ: pose parameters that control the joint angles and global posture
β: shape parameters that determine the body's shape
Expansions to SMPL like MANO (SMPL+H) and FLAME are also mentioned, along with SMPL-X, which captures the human body, face, and hands simultaneously. Recent developments in implicit models, such as Parametric Model-Conditioned Implicit Representation (PaMIR), are also noted for their flexible body representations.
The overview of deep learning for 3D HPE and HMR includes four components: data collection, deep learning model (encoder and decoder), learning methods, and output results. The deep learning model typically consists of an encoder (e.g., ResNet, HRNet) and a decoder (e.g., MLP, Transformer). Learning methods such as weakly supervised learning, unsupervised learning, and few-shot learning are employed to alleviate data dependency. Techniques like knowledge distillation, model pruning, and parameter quantization can be applied to reduce the model size. The results of 3D human pose estimation and mesh recovery can be represented in various forms, including keypoints, mesh, and voxels.
The survey classifies 3D human pose estimation into single-person and multi-person estimation. Single-person 3D pose estimation methods in images address depth ambiguity, body structure understanding, occlusion problems, and data lacking. Techniques include optical-aware methods, appropriate feature representation, joint-aware networks, limb-aware networks, graph-based methods using Graph Neural Networks (GNNs), and learnable triangulation. For single-person 3D pose estimation in videos, methods address single-frame limitation, real-time problems, body structure understanding, occlusion problems, and data lacking. Approaches include VideoPose3D, PoseFormer, MHFormer, and methods incorporating motion loss and human-joint affinity.
Multi-person 3D pose estimation is divided into top-down and bottom-up methods. Top-down methods address real-time problems, representation limitation, occlusion problems, and data lacking. Bottom-up methods address real-time problems, supervisory limitation, data lacking, and occlusion problems.
The survey categorizes HMR methods into template-based (parametric) and template-free (non-parametric) methods. Template-based HMR reconstructs predefined models (e.g., SCAPE, SMPL) by estimating the model's parameters. Template-free HMR predicts the 3D body directly from input data without relying on predefined models.
Template-based HMR methods for naked human body recovery include multimodal methods, attention mechanisms, temporal information exploitation, multi-view methods, boosting efficiency, developing various representations (e.g., texture map, UV map, heat map), utilizing structural information, and choosing appropriate learning strategies. Detailed human body recovery methods are categorized by incorporating clothes, hands, and the whole body.
Template-free HMR methods include regression-based, optimization-based differentiable rendering, implicit representations (e.g., PIFu, ARCH), Neural Radiance Fields (NeRF), and diffusion models.
The survey discusses evaluation metrics such as Mean Per Joint Position Error (MPJPE), Mean Per Joint Angle Error (MPJAE), Mean Per Joint Localization Error (MPJLE), and Mean Per Vertex Position Error (MPVPE). It also lists several datasets used for training and evaluating 3D HPE and HMR models, including Human3.6M, 3DPW, MPI-INF-3DPH, HumanEva, CMU-Panoptic, MuCo-3DHP, SURREAL, 3DOH50K, AMASS, DensePose, and THuman.
The paper concludes by discussing applications of 3D HPE and HMR, including motion retargeting, action recognition, security monitoring, SLAM, autonomous driving, and human-computer interaction. It identifies challenges and future research directions, such as large models, detailed reconstruction, crowding and occlusion challenges, and speed optimization.