Monocular Human Pose Estimation: A Survey of Deep Learning-based Methods (2006.01423v1)

Published 2 Jun 2020 in cs.CV

Abstract: Vision-based monocular human pose estimation, as one of the most fundamental and challenging problems in computer vision, aims to obtain posture of the human body from input images or video sequences. The recent developments of deep learning techniques have been brought significant progress and remarkable breakthroughs in the field of human pose estimation. This survey extensively reviews the recent deep learning-based 2D and 3D human pose estimation methods published since 2014. This paper summarizes the challenges, main frameworks, benchmark datasets, evaluation metrics, performance comparison, and discusses some promising future research directions.

Authors (3)

Yucheng Chen (16 papers)
Mingyi He (27 papers)
YingLi Tian (31 papers)

Citations (332)

View on Semantic Scholar

Summary

Overview of Monocular Human Pose Estimation: A Survey of Deep Learning-based Methods

The paper "Monocular Human Pose Estimation: A Survey of Deep Learning-based Methods" presents a comprehensive review of advancements in the field of human pose estimation (HPE) from monocular imagery, focusing on methodologies that leverage deep learning. HPE is a core challenge within computer vision, tasked with deducing human body postures from images or video sequences. This survey takes into account both 2D and 3D pose estimation approaches developed since 2014, marking significant progression due to the sophistication of deep learning algorithms.

Core Contributions and Methodologies

The survey categorizes HPE methods into four primary areas based on the nature of input and processing pipeline:

2D Single Person Pose Estimation: This section focuses on detecting body joints in images containing a single person. The authors distinguish methods into:
- Regression-based approaches which directly predict joint coordinates from images.
- Detection-based approaches that generate intermediate representations such as heatmaps, facilitating the localization of joints.
2D Multi-Person Pose Estimation: The complexity increases with multiple individuals in the frame, necessitating strategies like:
- Top-Down Approaches: These utilize person detectors to localize individual instances before applying single-person estimators.
- Bottom-Up Approaches: These methods directly infer all joint candidates, followed by grouping processes to associate joints with individual subjects.
3D Single Person Pose Estimation: Challenges increase with the necessity to estimate depth, along with spatial positioning, from 2D imagery:
- Model-Free Strategies: These do not depend on a predefined body model, often starting with 2D detections extended into 3D.
- Model-Based Methods: These incorporate parametric human body models to directly infer 3D configurations.
3D Multi-Person Pose Estimation: This problem area is addressed through integrated approaches that estimate 3D poses within crowded scenes, often leveraging multiple constraints to enhance accuracy.

Evaluation Metrics and Datasets

For benchmarking progress within these tasks, the paper discusses evaluation metrics and datasets critical to both 2D and 3D pose estimation. Datasets such as MPII Human Pose, COCO, and Human3.6M offer diverse environments and scales, providing challenges in terms of variations in pose, occlusions, and scalability. Metrics like Percentage of Correct Keypoints (PCK) and Mean Per Joint Position Error (MPJPE) are highlighted as standard practices for assessing model performance.

Implications and Future Directions

The authors provide insightful discourse on the current state of the field, acknowledging the robust advancements due to deep learning's application. The highlighted methodologies demonstrate the vast capabilities of neural networks in understanding human postures from images. However, challenges persist, primarily in coping with varied scales, occluded limbs, and complex multi-person interactions.

Future trajectories in HPE propose advancements in:

Developing efficient models that can operate in real-time scenarios, necessary for applications like surveillance and interactive systems.
Leveraging synthetic data and domain adaptation techniques to bridge the gap between real-world applications and training datasets, ensuring robustness across diverse conditions.
Enhancing interpretability and reliability within models to ensure comprehensive error analysis and rectification, particularly within 3D pose estimation paradigms.

In summary, this survey provides a thorough examination of the advancements in deep learning-based HPE methodologies, presenting a detailed landscape of current techniques, datasets, and challenges. It serves as a crucial resource for furthering progress in accurate and efficient pose estimation from monocular vision.

PDF Markdown