SoftPQ: Robust Instance Segmentation Evaluation via Soft Matching and Tunable Thresholds (2505.12155v2)

Published 17 May 2025 in cs.CV, cs.AI, and cs.LG

Abstract: Segmentation evaluation metrics traditionally rely on binary decision logic: predictions are either correct or incorrect, based on rigid IoU thresholds. Detection--based metrics such as F1 and mAP determine correctness at the object level using fixed overlap cutoffs, while overlap--based metrics like Intersection over Union (IoU) and Dice operate at the pixel level, often overlooking instance--level structure. Panoptic Quality (PQ) attempts to unify detection and segmentation assessment, but it remains dependent on hard-threshold matching--treating predictions below the threshold as entirely incorrect. This binary framing obscures important distinctions between qualitatively different errors and fails to reward gradual model improvements. We propose SoftPQ, a flexible and interpretable instance segmentation metric that redefines evaluation as a graded continuum rather than a binary classification. SoftPQ introduces tunable upper and lower IoU thresholds to define a partial matching region and applies a sublinear penalty function to ambiguous or fragmented predictions. These extensions allow SoftPQ to exhibit smoother score behavior, greater robustness to structural segmentation errors, and more informative feedback for model development and evaluation. Through controlled perturbation experiments, we show that SoftPQ captures meaningful differences in segmentation quality that existing metrics overlook, making it a practical and principled alternative for both benchmarking and iterative model refinement.

Summary

SoftPQ: Robust Instance Segmentation Evaluation via Soft Matching and Tunable Thresholds

The present paper introduces SoftPQ, a novel metric designed to enhance the evaluation process for instance segmentation tasks in computer vision. Traditional metrics such as F1, Intersection over Union (IoU), mean Average Precision (mAP), and Panoptic Quality (PQ) often adhere strictly to binary decision logic, evaluating correctness based on rigid IoU thresholds. These metrics can fail to differentiate between qualitatively diverse segmentation errors, consequently limiting their utility in iterative model refinement and development. SoftPQ addresses these issues by redefining segmentation evaluation as a graded continuum, incorporating tunable IoU thresholds to establish a range wherein partial matches are considered valid. This new metric aims to provide more informative feedback by adopting a sublinear penalty function for ambiguous or fragmented predictions.

Methodological Framework

SoftPQ represents a comprehensive extension and refinement of the PQ metric. It uses two adjustable IoU thresholds—upper and lower—to define a partial matching region within which predicted segments can be considered soft matches. Predictions surpassing the upper threshold are marked as strong matches, akin to the original PQ design. Meanwhile, those falling below contribute to the evaluation process with a nuanced understanding of partial overlaps, a feature particularly beneficial in over- and under-segmentation scenarios. Importantly, the metric retains backward compatibility with PQ when both thresholds are fixed at 0.5.

A distinctive component of SoftPQ is its sublinear penalty function used to calculate IoU contributions from soft matches. This weighted aggregation helps prevent an excessive impact from low-quality predictions while remaining sensitive to progressive improvements in segmentation accuracy. By applying this sublinear penalty, SoftPQ can provide finer feedback channels to segmenting models, facilitating effective model tuning and debugging cycles.

Experimental Insight

The paper presents rigorous evaluations of SoftPQ through controlled synthetic experiments characterized by common segmentation failure modes. These experiments demonstrate the behavior of SoftPQ relative to conventional metrics when exposures to sampling errors such as progressive erosion and over-segmentation are heightened. Across the experiments, SoftPQ demonstrates consistent robustness and interpretability, distinctly capturing the nuanced variations in segmentation quality that other metrics often overlook. Through tunable thresholds and weighted penalties, SoftPQ can also dynamically adapt to task-specific requirements, thus offering a practical and principled alternative for benchmarking models in diverse real-world applications.

Implications and Future Directions

The flexibility offered by the SoftPQ metric has considerable implications both theoretically and practically. Unlike traditional metrics, SoftPQ provides a dynamic framework for interpreting partial segmentations, which are commonplace in real-world settings. This adaptability can particularly aid the development of more sophisticated segmentation algorithms that must contend with structural segmentation errors in clinical imaging, autonomous systems, and industrial application contexts.

Future studies may focus on integrating SoftPQ with other state-of-the-art approaches, potentially exploring hybrid models that leverage soft matching principles for broader regions of structured prediction tasks. As AI continues to evolve, developing robust evaluation metrics like SoftPQ that can guide improvement is vital.

The implementation of SoftPQ signifies a productive step towards more responsive segmentation evaluation practices, ultimately fostering model advancements and improved performance in challenging evaluation scenarios.