Instance-level Human Parsing via Part Grouping Network (1808.00157v1)

Published 1 Aug 2018 in cs.CV

Abstract: Instance-level human parsing towards real-world human analysis scenarios is still under-explored due to the absence of sufficient data resources and technical difficulty in parsing multiple instances in a single pass. Several related works all follow the "parsing-by-detection" pipeline that heavily relies on separately trained detection models to localize instances and then performs human parsing for each instance sequentially. Nonetheless, two discrepant optimization targets of detection and parsing lead to suboptimal representation learning and error accumulation for final results. In this work, we make the first attempt to explore a detection-free Part Grouping Network (PGN) for efficiently parsing multiple people in an image in a single pass. Our PGN reformulates instance-level human parsing as two twinned sub-tasks that can be jointly learned and mutually refined via a unified network: 1) semantic part segmentation for assigning each pixel as a human part (e.g., face, arms); 2) instance-aware edge detection to group semantic parts into distinct person instances. Thus the shared intermediate representation would be endowed with capabilities in both characterizing fine-grained parts and inferring instance belongings of each part. Finally, a simple instance partition process is employed to get final results during inference. We conducted experiments on PASCAL-Person-Part dataset and our PGN outperforms all state-of-the-art methods. Furthermore, we show its superiority on a newly collected multi-person parsing dataset (CIHP) including 38,280 diverse images, which is the largest dataset so far and can facilitate more advanced human analysis. The CIHP benchmark and our source code are available at http://sysu-hcp.net/lip/.

Citations (320)

View on Semantic Scholar

Summary

The paper introduces a detection-free PGN that unifies semantic part segmentation and instance-aware edge detection to overcome traditional parsing limitations.
Key methodology includes a refinement branch that leverages shared contextual information to improve the integration between segmentation and edge detection tasks.
Experimental results on CIHP and PASCAL datasets demonstrate that PGN outperforms conventional parsing-by-detection methods with higher accuracy and robust performance.

Instance-level Human Parsing via Part Grouping Network

The paper "Instance-level Human Parsing via Part Grouping Network" addresses the challenging task of instance-level human parsing by proposing an innovative approach named Part Grouping Network (PGN). This research extends the boundary of human parsing from the traditional single-person focus to more complex real-world scenarios involving multiple individuals within a single image. This paper contributes to both the methodology for parsing and the availability of training data by introducing the Crowd Instance-level Human Parsing (CIHP) dataset.

Overview of Proposed Method

The Part Grouping Network introduces a detection-free approach to human parsing, bypassing the conventional "parsing-by-detection" methods that rely heavily on separately trained detection models. These traditional methods often suffer from the limitations of error accumulation and suboptimal representation due to the disparate objectives of detection and parsing phases. Instead, PGN unifies the objectives into a single network by embedding two conceptually twinned sub-tasks: semantic part segmentation and instance-aware edge detection.

Architecture and Functionality

Semantic Part Segmentation: This task assigns each pixel in the image to a respective human body part category. PGN utilizes a network based on Fully Convolutional Networks (FCNs) with ResNet-101 as the backbone, augmented with interpretation layers to produce a pixel-wise segmentation map.
Instance-aware Edge Detection: Here, PGN predicts instance boundaries by utilizing edge information. A refined multi-scale approach is employed to enhance the edge prediction performance, which is then used to demarcate different individuals in the image.
Refinement Branch: To better exploit the synergies between part segmentation and edge detection, the authors introduce a refinement branch that allows these two tasks to complement each other through shared contextual information, enhancing the network's robustness and accuracy.
Instance Partition Process: A novel post-processing step is designed to combine semantic part results and instance edges into coherent instance-level human parse results. This process involves grouping of line segments informed by both the segmentation and edge detection outputs.

Experimental Validation

The paper showcases the effectiveness of PGN through rigorous testing on two datasets. On the PASCAL-Person-Part dataset, PGN demonstrates superior performance in both semantic part segmentation and instance-aware edge detection compared to state-of-the-art methodologies. Specifically, PGN achieves a higher mean IoU for semantic parts and better edge detection scores (ODS, OIS).

Moreover, the authors introduce a more comprehensive CIHP dataset, which presents a diverse set of multi-person images to further test the capabilities of PGN in handling complex real-world scenarios. Considering the increased complexity of CIHP with varied poses, appearances, and higher resolutions, the proposed PGN architecture still maintains commendable performance figures and outperforms precedent methods in instance-level human parsing metrics ( $AP^r$ and $AP^r_{\text{vol}}$ ).

Implications and Future Directions

The implications of PGN and the CIHP dataset extend well beyond human parsing, providing a robust framework and resource that can be leveraged in related domains such as video surveillance, group behavior prediction, and interactive multimedia applications. The capability of parsing multiple human instances in challenging scenarios promotes the development of more robust models in computer vision.

Future work could explore further enhancements to the PGN framework such as incorporating temporal coherence to adapt it to video sequences or integrating domain adaptation techniques to improve its deployment on unseen datasets. Additionally, expanding the CIHP dataset with more annotated categories could enrich model training, facilitating advances in human analysis tasks within diverse application areas.

Without the reliance on detection models, PGN sets a foundation for efficient human parsing pipelines that handle multiple instances directly, offering a more streamlined and realistic approach to segmenting human instances in complex scenes. This work represents a significant step towards more generalized solutions in the domain of human parsing.

PDF Markdown