Pose2Seg: Detection Free Human Instance Segmentation (1803.10683v3)

Published 28 Mar 2018 in cs.CV

Abstract: The standard approach to image instance segmentation is to perform the object detection first, and then segment the object from the detection bounding-box. More recently, deep learning methods like Mask R-CNN perform them jointly. However, little research takes into account the uniqueness of the "human" category, which can be well defined by the pose skeleton. Moreover, the human pose skeleton can be used to better distinguish instances with heavy occlusion than using bounding-boxes. In this paper, we present a brand new pose-based instance segmentation framework for humans which separates instances based on human pose, rather than proposal region detection. We demonstrate that our pose-based framework can achieve better accuracy than the state-of-art detection-based approach on the human instance segmentation problem, and can moreover better handle occlusion. Furthermore, there are few public datasets containing many heavily occluded humans along with comprehensive annotations, which makes this a challenging problem seldom noticed by researchers. Therefore, in this paper we introduce a new benchmark "Occluded Human (OCHuman)", which focuses on occluded humans with comprehensive annotations including bounding-box, human pose and instance masks. This dataset contains 8110 detailed annotated human instances within 4731 images. With an average 0.67 MaxIoU for each person, OCHuman is the most complex and challenging dataset related to human instance segmentation. Through this dataset, we want to emphasize occlusion as a challenging problem for researchers to study.

Citations (183)

View on Semantic Scholar

Summary

The paper introduces a novel pose-based framework that bypasses traditional detection by leveraging human skeleton features for precise instance segmentation.
It employs an Affine-Align module to standardize each instance’s scale and orientation, significantly improving segmentation in occluded environments.
Experiments on the OCHuman and COCO datasets show up to 50% higher AP in occlusion scenarios compared to Mask R-CNN, highlighting its robust performance.

Pose2Seg: Detection-Free Human Instance Segmentation

The paper "Pose2Seg: Detection Free Human Instance Segmentation" introduces a novel framework for human instance segmentation that diverges from the conventional methods relying heavily on detection techniques, such as bounding-box proposals. Instead, the Pose2Seg framework capitalizes on the unique structural features of the human body delineated by pose skeletons, enabling enhanced distinction between instances, especially in occlusion scenarios.

Core Methodology

Pose-based Framework: The primary innovation lies in using human pose skeletons as the central element to inform instance segmentation. This approach circumvents common issues with models like Mask R-CNN that rely on detection-based bounding boxes, particularly struggling with occlusions where bounding boxes become ineffective.
Affine-Align Module: Integral to the framework is the Affine-Align operation which calibrates each instance to a standardized scale and orientation based on its pose. This is in stark contrast to traditional bounding-box approaches, enabling better handling of instances with irregular or occluded orientations.
Skeleton Features: The segmentation module benefits from the inclusion of extended skeleton features that inform the network of distinct body parts, enhancing accuracy in distinguishing intertwined human instances. This feature directly supplements the image features extracted by the base network.

Dataset Contribution

Recognizing the limitations in current datasets regarding occlusion annotations, the authors offer the "Occluded Human" (OCHuman) benchmark, showcasing their framework’s proficiency in heavily occluded instances. This dataset constitutes a significant addition to the field, providing robust benchmarks with comprehensive annotations, including bounding boxes, pose keypoints, and instance masks.

Experimental Findings

The Pose2Seg model is evaluated against established frameworks such as Mask R-CNN. On the OCHuman dataset, Pose2Seg demonstrates superior performance, achieving approximately 50% higher AP in occlusion scenarios. This clearly underscores the utility of leveraging human pose in segmentation tasks, particularly when addressing heavily occluded environments. Additionally, when integrated with the COCO dataset for training, Pose2Seg continues to outperform Mask R-CNN, demonstrating its versatility in both occlusion and general segmentation tasks.

Implications and Future Directions

The implications of Pose2Seg extend beyond just segmentation accuracy. By emphasizing pose skeleton over traditional detection frameworks, this paper advocates for a paradigm shift in segmentation tasks, especially those involving humans. This method not only addresses the occlusion challenges but also offers insights into real-world applications, like video surveillance, sports analytics, and AR/VR interfaces where human detection and tracking are paramount.

Looking ahead, further research could expand on the robustness of pose-based alignments in more generalized object categories. Additionally, advancing the real-time efficiency and scalability of pose-based methods can significantly benefit practical applications, bridging the gap between innovative research and commercial deployment.

In conclusion, Pose2Seg presents a compelling argument for elevating human pose to the forefront of instance segmentation. Its unique approach, buttressed by a comprehensive occlusion-focused dataset, sets a new trajectory in computer vision research, earmarking human pose as a formidable contender in achieving refined segmentation outcomes amidst complex visual environments.

PDF Markdown