Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

110 tokens/sec

GPT-4o

56 tokens/sec

Gemini 2.5 Pro Pro

44 tokens/sec

o3 Pro

6 tokens/sec

GPT-4.1 Pro

47 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

503

Deep Learning-Based Human Pose Estimation: A Survey (2012.13392v5)

Published 24 Dec 2020 in cs.CV, cs.GR, and cs.MM

Abstract: Human pose estimation aims to locate the human body parts and build human body representation (e.g., body skeleton) from input data such as images and videos. It has drawn increasing attention during the past decade and has been utilized in a wide range of applications including human-computer interaction, motion analysis, augmented reality, and virtual reality. Although the recently developed deep learning-based solutions have achieved high performance in human pose estimation, there still remain challenges due to insufficient training data, depth ambiguities, and occlusion. The goal of this survey paper is to provide a comprehensive review of recent deep learning-based solutions for both 2D and 3D pose estimation via a systematic analysis and comparison of these solutions based on their input data and inference procedures. More than 250 research papers since 2014 are covered in this survey. Furthermore, 2D and 3D human pose estimation datasets and evaluation metrics are included. Quantitative performance comparisons of the reviewed methods on popular datasets are summarized and discussed. Finally, the challenges involved, applications, and future research directions are concluded. A regularly updated project page is provided: \url{https://github.com/zczcwh/DL-HPE}

References (316)

Authors (8)

Ce Zheng (45 papers)
Wenhan Wu (9 papers)
Chen Chen (753 papers)
Taojiannan Yang (26 papers)
Sijie Zhu (27 papers)
Ju Shen (9 papers)
Nasser Kehtarnavaz (15 papers)
Mubarak Shah (208 papers)

Citations (486)

View on Semantic Scholar

Summary

Overview of Deep Learning-Based Human Pose Estimation: A Survey

The survey paper titled "Deep Learning-Based Human Pose Estimation: A Survey" provides a comprehensive analysis of the advancements in human pose estimation (HPE) through deep learning methodologies. The research targets both 2D and 3D pose estimation problems, addressing challenges such as occlusion, depth ambiguities, and insufficient training data. Covering over 260 papers since 2014, this survey is invaluable for researchers aiming to understand the progressive landscape of HPE solutions.

Key Contributions

This survey delineates the landscape of HPE by categorizing approaches based on data sources such as monocular images, videos, and other sensors like depth and inertial measurement units. Important datasets and evaluation metrics are discussed, providing a critical comparison of methods across standard benchmarks. The paper also explores applications in various fields, including augmented reality and healthcare, highlighting both the achievements and persistent challenges in HPE.

Methodological Insights

2D Pose Estimation

The paper categorizes 2D pose estimation into single-person and multi-person scenarios:

Single-Person Pose Estimation: It discusses regression-based methods that map images directly to joint coordinates, as well as heatmap-based methods which predict joint probability maps. Heatmap-based approaches, such as the HRNet, have shown impressive performance due to their ability to preserve spatial locality.
Multi-Person Pose Estimation: This is divided further into top-down methods, which perform human detection followed by pose estimation, and bottom-up methods, which locate body parts and assemble them into full poses. Top-down techniques usually exhibit higher accuracy, whereas bottom-up techniques offer computational advantages, especially in crowded scenes.

3D Pose Estimation

3D pose estimation is described under single-view and multi-view categorizations:

Single-View: Methods are further split into direct approaches that estimate 3D poses directly from images and 2D-to-3D methods that first compute 2D poses. The latter, leveraging strong 2D detectors and lifting transformations, consistently provide better results.
Multi-View and Additional Sensors: For overcoming occlusion problems, multi-view setups are addressed, while other sensors like depth cameras and IMUs offer alternative means for constructing robust 3D pose frameworks.

Quantitative and Comparative Analysis

The survey includes an extensive quantitative analysis, comparing the performance of various HPE methods using benchmarks like MPII, COCO, and Human3.6M datasets. Notably, transformer architectures have been explored recently for their ability to capture complex dependencies, achieving competitive results while enhancing computational efficiency.

Implications and Future Directions

The research presents significant implications for real-world applications, including action recognition, healthcare, and virtual reality. Nevertheless, challenges persist, such as improving occlusion handling and achieving efficient solutions deployable in resource-constrained environments. Future directions include domain adaptation, leveraging multi-objective neural architecture search, and enhancing robustness against adversarial attacks.

Conclusion

This survey offers an extensive review of the state-of-the-art in deep learning-based human pose estimation, providing a valuable resource for researchers and practitioners alike. Through its systemic classification and detailed analysis, it lays a foundation for future advances in HPE, encouraging further exploration into diverse and complex application scenarios.

PDF Markdown

GitHub

GitHub - zczcwh/DL-HPE (503 stars)