Panoptic Segmentation (1801.00868v3)

Published 3 Jan 2018 in cs.CV

Abstract: We propose and study a task we name panoptic segmentation (PS). Panoptic segmentation unifies the typically distinct tasks of semantic segmentation (assign a class label to each pixel) and instance segmentation (detect and segment each object instance). The proposed task requires generating a coherent scene segmentation that is rich and complete, an important step toward real-world vision systems. While early work in computer vision addressed related image/scene parsing tasks, these are not currently popular, possibly due to lack of appropriate metrics or associated recognition challenges. To address this, we propose a novel panoptic quality (PQ) metric that captures performance for all classes (stuff and things) in an interpretable and unified manner. Using the proposed metric, we perform a rigorous study of both human and machine performance for PS on three existing datasets, revealing interesting insights about the task. The aim of our work is to revive the interest of the community in a more unified view of image segmentation.

PDF Abstract

Panoptic segmentation is a computer vision task introduced in the paper "Panoptic Segmentation" (Kirillov et al., 2018 ) that aims to unify the traditionally separate tasks of semantic segmentation and instance segmentation. While semantic segmentation assigns a class label (like 'sky', 'road', 'car') to every pixel in an image, and instance segmentation detects and segments each individual object instance (like 'car_1', 'car_2'), panoptic segmentation requires a system to produce a single output that assigns both a semantic label and an instance ID to every pixel. This means the output is a coherent scene segmentation where amorphous background regions (stuff) get a semantic label, and countable objects (things) get both a semantic label and a unique instance ID, with no overlapping segments.

The motivation behind defining this task is the observation that real-world vision systems, such as those used in autonomous driving or augmented reality, need a complete and coherent understanding of a scene, recognizing both the background areas and individual objects simultaneously. Existing methods and metrics were specialized for either stuff (semantic segmentation) or things (instance segmentation), hindering the development of unified systems.

The output format for panoptic segmentation is straightforward: for each pixel, provide a pair $(l, z)$ , where $l$ is the semantic class label and $z$ is the instance ID. For stuff classes, the instance ID $z$ is effectively ignored or can be a standard value (e.g., 0), as stuff regions are amorphous and uncountable. For thing classes, pixels with the same $(l, z)$ pair belong to the same object instance. A key constraint is that all segments (both stuff and thing) must be non-overlapping.

A significant contribution of the paper is the introduction of the Panoptic Quality (PQ) metric, designed to evaluate performance on this unified task across both stuff and thing classes in a uniform manner. Unlike existing metrics like mean IoU (for semantic segmentation) or Average Precision (AP) (for instance segmentation, which requires confidence scores and allows overlaps), PQ provides a single, interpretable measure.

The PQ metric is calculated in two steps:

Segment Matching: Predicted segments and ground truth segments are matched if their Intersection over Union (IoU) is strictly greater than 0.5. The authors prove that this threshold guarantees a unique matching between predicted and ground truth segments due to the non-overlapping nature of panoptic segmentations.
PQ Computation: Based on the unique matching, segments are categorized as True Positives (TP - matched pairs), False Positives (FP - unmatched predicted segments), and False Negatives (FN - unmatched ground truth segments) for each class. PQ is then computed per class and averaged:

$\text{PQ} = \frac{\sum_{(p, g) \in TP} \text{IoU}(p, g)}{|TP| + \frac{1}{2}|FP| + \frac{1}{2}|FN|}$

This formula can be decomposed into two components:
- Segmentation Quality (SQ): $\text{SQ} = \frac{\sum_{(p, g) \in TP} \text{IoU}(p, g)}{|TP|}$ (Average IoU of matched segments)
- Recognition Quality (RQ): $\text{RQ} = \frac{|TP|}{|TP| + \frac{1}{2} |FP| + \frac{1}{2} |FN|}$ (Equivalent to the $F_1$ score, measuring detection/recognition performance) Thus, $\text{PQ} = \text{SQ} \times \text{RQ}$ . This decomposition helps in analyzing the types of errors made by a system. The metric handles special cases like void pixels and annotated groups of instances.

The paper studies panoptic segmentation on three datasets with both semantic and instance annotations: Cityscapes [Cordts2016Cityscapes], ADE20k [zhou2017ade20k], and Mapillary Vistas [neuhold2017mapillary].

A human consistency paper was conducted using images with double annotations on these datasets. The results showed that human annotators are not perfect, and consistency varies across datasets. Analysis using the PQ metric revealed insights:

Consistency for stuff and thing classes is relatively similar, although things are slightly harder. PQ successfully balances performance across both.
Consistency varies significantly with object scale. For small objects, Recognition Quality (RQ) drops dramatically, indicating humans have difficulty finding small objects, but Segmentation Quality (SQ) remains relatively high if they are found. This suggests finding/detecting small objects is a major challenge.
The chosen IoU matching threshold of 0.5 is reasonable; most matches under an optimal matching strategy already exceed this threshold.

To establish machine performance baselines, the authors combined the outputs of existing state-of-the-art semantic segmentation and instance segmentation models using simple heuristics. This involved:

Applying Non-Maximum Suppression (NMS)-like logic to the instance segmentation output to resolve overlaps among object instances based on confidence scores.
Combining the non-overlapping instance segments with the semantic segmentation output. Overlaps between thing instances and stuff regions were resolved by favoring the thing instance.

The results showed:

Combining separate systems provides a basic panoptic segmentation.
PQ correlates well with existing metrics (AP for instances, mIoU for semantics) when evaluated on the respective subsets, but provides a unified view.
Compared to human consistency, the machine baselines showed a significant performance gap. This gap was particularly large in Recognition Quality (RQ), suggesting that finding and correctly classifying segments (especially instances) is a bigger challenge for current machine models than accurately segmenting them once found (SQ). The gap was smaller for SQ.

Practical Implications and Future Directions:

Implementing panoptic segmentation requires models capable of both dense pixel classification (for stuff) and instance-aware segmentation (for things), and critically, resolving overlaps to produce a single coherent output. The baseline approach demonstrates one way to implement this by combining existing systems, but this requires running multiple models and applying post-processing logic, which can be computationally expensive and prone to inconsistencies.

The paper posits that the PS task will drive innovation in:

Deeply Integrated End-to-End Models: Developing single network architectures that jointly perform stuff and thing segmentation and inherently produce non-overlapping outputs. This would be more efficient and potentially more consistent than combining separate models. Several subsequent works have explored this direction, such as PanopticFPN [kirillov2019panopticfpn], UPSNet [xiong2019upsnet], and DeeperLab [yang2019deeperlab].
Higher-Level Reasoning: Implementing more sophisticated methods, possibly learned, for resolving conflicts and overlaps between potential segments from different categories.

The definition of the task and the PQ metric have provided a clear target and evaluation method for unified scene understanding. The task's inclusion in major challenges like COCO and Mapillary Vistas highlights its practical relevance and encourages the development of more advanced, integrated systems for comprehensive image segmentation. The identified human-machine gap, particularly in recognition (RQ), points towards key areas for future algorithmic improvement.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Alexander Kirillov (27 papers)
Kaiming He (71 papers)
Ross Girshick (75 papers)
Carsten Rother (74 papers)
Piotr Dollár (49 papers)

Citations (1,314)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos