Extreme clicking for efficient object annotation

Published 9 Aug 2017 in cs.CV | (1708.02750v1)

Abstract: Manually annotating object bounding boxes is central to building computer vision datasets, and it is very time consuming (annotating ILSVRC [53] took 35s for one high-quality box [62]). It involves clicking on imaginary corners of a tight box around the object. This is difficult as these corners are often outside the actual object and several adjustments are required to obtain a tight box. We propose extreme clicking instead: we ask the annotator to click on four physical points on the object: the top, bottom, left- and right-most points. This task is more natural and these points are easy to find. We crowd-source extreme point annotations for PASCAL VOC 2007 and 2012 and show that (1) annotation time is only 7s per box, 5x faster than the traditional way of drawing boxes [62]; (2) the quality of the boxes is as good as the original ground-truth drawn the traditional way; (3) detectors trained on our annotations are as accurate as those trained on the original ground-truth. Moreover, our extreme clicking strategy not only yields box coordinates, but also four accurate boundary points. We show (4) how to incorporate them into GrabCut to obtain more accurate segmentations than those delivered when initializing it from bounding boxes; (5) semantic segmentations models trained on these segmentations outperform those trained on segmentations derived from bounding boxes.

Abstract PDF Upgrade to Chat

Citations (235)

View on Semantic Scholar

Summary

The paper demonstrates that extreme clicking reduces annotation time from 35 to 7 seconds per bounding box, achieving an 88% IoU on VOC 2007.
The method simplifies object annotation by replacing traditional boundary drawing with four extreme point clicks, lowering cognitive load and speeding up the process.
Integrating extreme clicking with segmentation tools like GrabCut yields a 2%-4% mIoU improvement, enhancing overall segmentation precision.

Overview of "Extreme clicking for efficient object annotation"

The authors of the paper propose a method called "extreme clicking" to improve the efficiency of annotating object bounding boxes, which are foundational for constructing computer vision datasets. The traditional method of box annotation, as exemplified by the ILSVRC dataset, is labor-intensive, requiring an average of 35 seconds per high-quality bounding box. This process involves selecting corners of an imaginary bounding box surrounding the object. The extreme clicking technique simplifies this task by having annotators click on four physically defined points on the object: the top, bottom, left-most, and right-most points.

The paper provides evidence that extreme clicking offers significant efficiency gains without compromising annotation quality. On PASCAL VOC 2007 and 2012 datasets, annotation time dropped to just 7 seconds per bounding box—a fivefold improvement—while maintaining comparable quality to traditional annotation methods. The bounding boxes achieved via extreme clicking exhibit a mean Intersection over Union (IoU) comparable to the ground-truth annotations (88% on VOC 2007), indicating similar accuracy. Moreover, object detectors trained on extreme clicking annotations perform equivalently to those trained on manually drawn ground-truth boxes.

A notable advantage of this approach is that it provides more than just bounding box coordinates. The methodology results in additional information—specifically, four accurate boundary points that can be integrated into segmentation algorithms like GrabCut. This integration results in more precise segmentations, with a mean Improvement over Union (mIoU) increase of 2%-4% compared to segmentations initialized from traditional bounding boxes. In practice, models trained on the more accurate segmentations derived from extreme clicking outperform those trained on segmentations from traditional bounding boxes, with an observed mIoU improvement of 2.6%.

Implications and Future Directions

Practically, extreme clicking can significantly reduce the cost and time required for annotating large datasets, which is a bottleneck in the development of computer vision systems. The method's efficiency could facilitate the creation of larger, more diverse datasets, enhancing model training and performance.

From a theoretical standpoint, this work exemplifies how task simplification through human-computer interaction design can yield substantial improvements. By focusing on well-defined physical features rather than abstract boundaries, the method reduces cognitive load and task-switching, elements that traditionally hinder annotation speed and accuracy.

Future work could explore the generalization of this method to other types of annotations beyond object detection, such as scene understanding or relationship annotations. Another possible direction is the integration of this approach within automated annotation systems, leveraging the speed of extreme clicking for rapid semi-automatic dataset annotations.

Overall, the extreme clicking technique presents a compelling alternative to conventional annotation processes, offering promising avenues for enhancing the efficiency and scale of computer vision dataset curation.

Markdown