What's the Point: Semantic Segmentation with Point Supervision (1506.02106v5)

Published 6 Jun 2015 in cs.CV

Abstract: The semantic image segmentation task presents a trade-off between test time accuracy and training-time annotation cost. Detailed per-pixel annotations enable training accurate models but are very time-consuming to obtain, image-level class labels are an order of magnitude cheaper but result in less accurate models. We take a natural step from image-level annotation towards stronger supervision: we ask annotators to point to an object if one exists. We incorporate this point supervision along with a novel objectness potential in the training loss function of a CNN model. Experimental results on the PASCAL VOC 2012 benchmark reveal that the combined effect of point-level supervision and objectness potential yields an improvement of 12.9% mIOU over image-level supervision. Further, we demonstrate that models trained with point-level supervision are more accurate than models trained with image-level, squiggle-level or full supervision given a fixed annotation budget.

Citations (938)

View on Semantic Scholar

Summary

The paper presents a point supervision method that integrates objectness priors, achieving up to 42.7% mIOU improvement.
It drastically cuts annotation time from 239.7 seconds to 22.1 seconds per image by using single-point annotations.
The approach outperforms traditional weak supervision methods, offering a cost-effective balance between labeling effort and model accuracy.

Overview of "What's the Point: Semantic Segmentation with Point Supervision"

This paper addresses the challenge of semantic image segmentation, focusing on the trade-off between segmentation accuracy at test time and the cost of data annotation during training. High accuracy models traditionally necessitate per-pixel annotations, which are prohibitively time-consuming. Conversely, relying solely on image-level class labels significantly reduces accuracy. The authors present a novel intermediate supervision strategy, which involves annotators simply pointing to objects, and they demonstrate its efficacy through experimental results on the PASCAL VOC 2012 dataset.

Methodology

The core methodology of this paper includes two primary innovations:

Point Supervision: Annotators are asked to point to objects in the images resulting in a single annotated point per object class. The use of points simplifies the annotation process while offering meaningful supervised signals to train semantic segmentation models.
Objectness Potential: Integrated into the training loss of a Convolutional Neural Network (CNN), the objectness potential assists in accurately determining object boundaries by incorporating a generic measure of whether a pixel belongs to an object.

The proposed CNN model extends the Fully Convolutional Network (FCN) architecture, with modifications to leverage both point-level supervision and objectness priors. The model’s loss function is designed to include terms for point supervision and objectness potential in addition to the conventional terms for image-level and full supervision.

Experimental Results

The experiments conducted on the PASCAL VOC 2012 benchmark demonstrate several key outcomes:

The proposed supervision method combining point-level supervision with objectness priors significantly improves the mean intersection over union (mIOU) by 12.9% compared to models trained with only image-level labels.
When point-level supervision is combined with objectness priors, the mIOU further improves to 42.7%, which notably exceeds the performance of models trained with full supervision given the same annotation budget.

Analysis of Annotations

The paper provides an in-depth analysis of the annotation time and quality:

Annotation Time: Point annotations take significantly less time (22.1 seconds per image) as compared to full supervision (239.7 seconds per image).
Error Rates: The error rate for point annotations is low, with annotators incorrectly identifying object classes in only 1% of cases, indicating that point annotations are a reliable form of supervision.

Comparative Analysis

The authors compare their method against other forms of weak supervision, such as image-level labels, squiggles, and full supervision. Results indicate that point-level supervision provides an optimal balance between annotation efficiency and model accuracy. A hybrid strategy, combining point-level supervision with a limited amount of full supervision, yields a significant boost in performance, suggesting that even a small number of fully annotated images can substantially aid learning.

Implications and Future Directions

The implications of this research are profound for both practical and theoretical developments in AI:

Practical Implications: The proposed point supervision method dramatically reduces the cost of data annotation, making it feasible to train high-performance segmentation models without exhaustive per-pixel labels.
Theoretical Implications: The integration of objectness priors into the loss function suggests new ways to leverage unsupervised or weakly supervised signals in training deep learning models.

Future developments could explore the synergistic use of point-level supervision with other forms of weak supervision and further refine the objectness potential to improve segmentation accuracy. Additionally, the methods could be extended to more complex datasets and tasks beyond semantic segmentation to validate their generality.

Conclusion

In conclusion, "What's the Point: Semantic Segmentation with Point Supervision" makes a compelling case for the use of point-level supervision to balance annotation cost and model accuracy effectively. By introducing a novel training regime that incorporates both point supervision and objectness priors, the authors advance the field of semantic segmentation and open new avenues for efficient, high-accuracy model training.

PDF Markdown