CrowdPose: Efficient Crowded Scenes Pose Estimation and A New Benchmark (1812.00324v2)

Published 2 Dec 2018 in cs.CV

Abstract: Multi-person pose estimation is fundamental to many computer vision tasks and has made significant progress in recent years. However, few previous methods explored the problem of pose estimation in crowded scenes while it remains challenging and inevitable in many scenarios. Moreover, current benchmarks cannot provide an appropriate evaluation for such cases. In this paper, we propose a novel and efficient method to tackle the problem of pose estimation in the crowd and a new dataset to better evaluate algorithms. Our model consists of two key components: joint-candidate single person pose estimation (SPPE) and global maximum joints association. With multi-peak prediction for each joint and global association using graph model, our method is robust to inevitable interference in crowded scenes and very efficient in inference. The proposed method surpasses the state-of-the-art methods on CrowdPose dataset by 5.2 mAP and results on MSCOCO dataset demonstrate the generalization ability of our method. Source code and dataset will be made publicly available.

Citations (476)

View on Semantic Scholar

Summary

The paper presents CrowdPose, a novel top‐down framework that uses joint-candidate SPPE and global maximum joint association to robustly handle occlusions in crowded scenes.
The new CrowdPose benchmark offers a near-uniform distribution of crowd densities, enabling more consistent evaluation of multi-person pose estimation methods.
The approach outperforms state-of-the-art methods by achieving a 5.2 mAP gain on the CrowdPose dataset and a 0.8 mAP improvement on the MSCOCO benchmark.

Overview of CrowdPose: Efficient Crowded Scenes Pose Estimation and A New Benchmark

The paper "CrowdPose: Efficient Crowded Scenes Pose Estimation and A New Benchmark" addresses the challenge of multi-person pose estimation in densely populated environments. Traditional methods struggle with occlusions and interference in crowded scenes, primarily due to the absence of specialized benchmarks and methodologies tailored for such complexities. The authors propose a novel approach combined with a new dataset, aiming to improve the robustness and efficiency of pose estimation in these challenging conditions.

Methodology

The proposed framework is a top-down approach characterized by two key components: a joint-candidate single person pose estimation (SPPE) and a global maximum joints association method. These elements are designed to handle the interference prevalent in crowded scenes effectively.

Joint-Candidate SPPE: This component outputs multiple candidate locations for each joint, encompassing both target and interference joints. By doing so, it improves resilience against the noise introduced by overlapping human figures.
Global Maximum Joints Association: Utilizing a graph model, the method associates joints globally, reducing errors in joint assembly that are common in densely populated scenes.

The authors also introduce a new dataset, CrowdPose, to better evaluate algorithms' performance in crowded environments. The CrowdPose dataset presents a near-uniform distribution of crowding levels, which ensures that pose estimation models must perform well across varying levels of crowd density.

Numerical Results and Bold Claims

The paper demonstrates that their method surpasses contemporary state-of-the-art methods by 5.2 mAP on the CrowdPose dataset. Additionally, when integrated with an existing method's framework, it enhances performance on the MSCOCO dataset by 0.8 mAP. These results indicate not only its efficacy under crowded conditions but also its generalizability to other datasets.

Practical and Theoretical Implications

Practically, this work enables more accurate activity recognition and human-robot interaction in environments where crowd density presents significant challenges. Theoretically, it shifts focus towards global optimization techniques in pose estimation rather than relying solely on local detection methods.

The introduction of the CrowdPose benchmark is pivotal, as it addresses the gap in the evaluation landscape for pose estimation under crowded conditions. This dataset can catalyze further research into more sophisticated methods tailored to difficult-to-handle situations with substantial inter-person occlusions.

Future Directions

Future research may explore integrating these techniques into real-time systems, given the efficient computational complexity that matches conventional NMS algorithms. Additionally, extending the approach to consider temporal dynamics in video sequences could provide enhanced results by utilizing movement information to resolve ambiguities that arise in a single frame.

Overall, the paper makes substantial contributions to advancing multi-person pose estimation in complex environments, with a robust methodological approach and a valuable new benchmark for the community.

PDF Markdown