DeepCut: Joint Subset Partition and Labeling for Multi Person Pose Estimation (1511.06645v2)

Published 20 Nov 2015 in cs.CV

Abstract: This paper considers the task of articulated human pose estimation of multiple people in real world images. We propose an approach that jointly solves the tasks of detection and pose estimation: it infers the number of persons in a scene, identifies occluded body parts, and disambiguates body parts between people in close proximity of each other. This joint formulation is in contrast to previous strategies, that address the problem by first detecting people and subsequently estimating their body pose. We propose a partitioning and labeling formulation of a set of body-part hypotheses generated with CNN-based part detectors. Our formulation, an instance of an integer linear program, implicitly performs non-maximum suppression on the set of part candidates and groups them to form configurations of body parts respecting geometric and appearance constraints. Experiments on four different datasets demonstrate state-of-the-art results for both single person and multi person pose estimation. Models and code available at http://pose.mpi-inf.mpg.de.

Citations (934)

View on Semantic Scholar

Summary

The paper introduces a unified ILP framework that jointly partitions and labels human body-part hypotheses for multi-person pose estimation.
It integrates CNN-based detections with pairwise constraints to assign parts to individuals, effectively handling occlusions and overlapping poses.
Experimental results demonstrate state-of-the-art performance, achieving 87.1% PCK on the LSP dataset and notable gains on MPII and WAF datasets.

Joint Subset Partitioning and Labeling for Multi-Person Pose Estimation

The paper "DeepCut: Joint Subset Partition and Labeling for Multi Person Pose Estimation" by Pishchulin et al. introduces a novel framework for multi-person pose estimation in real-world images. This framework departs from traditional methods that sequentially address detection and pose estimation tasks, integrating them into a unified process. The proposed approach uses a joint formulation to estimate the number of persons in an image, assign body parts to individuals, and handle occlusions effectively.

Problem Definition

The central problem addressed by this work is the estimation of articulated human poses within images containing multiple individuals. Traditional approaches separate the detection and pose estimation tasks, generally suffering when individuals are in close proximity. This paper proposes a unified approach employing a partitioning and labeling formulation, effectively cast as an instance of an integer linear program (ILP). This formulation leverages CNN-based part detectors to generate a set of body-part hypotheses, which are then partitioned and labeled to form coherent body part configurations.

Methodology

The core of the proposed methodology is an ILP that integrates detection and pose estimation. The formulation allows for:

Determination of the number of people in an image,
Assignment of parts to person instances while respecting geometric and appearance constraints,
Implicit non-maximum suppression to deactivate or merge part hypotheses.

The inclusion of pairwise terms ensures that the body parts assigned to an individual form plausible configurations, and the integration of these terms into the ILP allows for solving the problem with bounds and optimality guarantees.

Numerical Results and Experimental Validation

The proposed model is experimentally validated on four datasets, demonstrating state-of-the-art performance for both single-person and multi-person pose estimation tasks. Key results include:

On the LSP dataset, the Dense-CNN variant of the approach significantly outperforms existing methods, achieving 87.1% PCK.
On the MPII Single Person dataset, the Dense-CNN variant achieves 82.4% PCK, again surpassing previous methods.
The approach shows a notable improvement on the WAF dataset for multi-person pose estimation, with a marked increase in $m$ PCP and AOP metrics.

Implications

Practically, this unified approach allows for more robust handling of crowded scenes, where traditional methods falter due to overlapping detections and ambiguities. The ability to deactivate hypotheses based on the overall configuration of parts results in fewer false positives. Theoretically, this work extends the application of ILP in pose estimation, demonstrating that complex inference tasks involving multiple interconnected components can be effectively solved within this framework.

Future Directions

Further advancements could be realized by integrating more sophisticated pairwise terms that capture higher-order relationships between body parts and leveraging more intricate neural architectures. Embedding the entire process into a single end-to-end trainable network might also provide incremental improvements by leveraging joint optimization over the part detection and configuration steps.

In sum, the "DeepCut" framework sets a new benchmark for multi-person pose estimation tasks, demonstrating the advantages of integrating detection and pose estimation into a single coherent process. The resulting model is robust, scalable, and capable of handling the complexities associated with articulated pose estimation in real-world multi-person scenarios. Future research may continue to refine and extend this approach, improving its efficiency and accuracy further.

PDF Markdown