BoxInst: High-Performance Instance Segmentation with Box Annotations (2012.02310v1)

Published 3 Dec 2020 in cs.CV

Abstract: We present a high-performance method that can achieve mask-level instance segmentation with only bounding-box annotations for training. While this setting has been studied in the literature, here we show significantly stronger performance with a simple design (e.g., dramatically improving previous best reported mask AP of 21.1% in Hsu et al. (2019) to 31.6% on the COCO dataset). Our core idea is to redesign the loss of learning masks in instance segmentation, with no modification to the segmentation network itself. The new loss functions can supervise the mask training without relying on mask annotations. This is made possible with two loss terms, namely, 1) a surrogate term that minimizes the discrepancy between the projections of the ground-truth box and the predicted mask; 2) a pairwise loss that can exploit the prior that proximal pixels with similar colors are very likely to have the same category label. Experiments demonstrate that the redesigned mask loss can yield surprisingly high-quality instance masks with only box annotations. For example, without using any mask annotations, with a ResNet-101 backbone and 3x training schedule, we achieve 33.2% mask AP on COCO test-dev split (vs. 39.1% of the fully supervised counterpart). Our excellent experiment results on COCO and Pascal VOC indicate that our method dramatically narrows the performance gap between weakly and fully supervised instance segmentation. Code is available at: https://git.io/AdelaiDet

PDF Abstract

High-Performance Instance Segmentation with BoxInst

In the field of computer vision, instance segmentation—the process of detecting objects and delineating precise pixel-wise boundaries—is a challenging yet fundamental task. Traditional instance segmentation methods require pixel-level mask annotations, which are expensive and labor-intensive to acquire. The paper introduces BoxInst, a method that achieves high-fidelity instance segmentation using only bounding-box annotations, effectively bridging the gap between weakly and fully supervised segmentation.

Methodology

BoxInst proposes a novel approach by redesigning the mask learning process to rely solely on bounding-box annotations. This is achieved through two key loss functions that do not require pixel-level annotations:

Projection Loss: This loss term ensures that the horizontal and vertical projections of the predicted mask align with those of the bounding box. This concept effectively encodes the condition that the tightest bounding box encompassing the predicted mask should coincide with the ground-truth box.
Pairwise Affinity Loss: Drawing on spatial and visual consistency, this term supervises the label similarity between proximate pixels. By leveraging the natural coherence in pixel color to infer similar labels, this loss term bypasses the necessity of explicit mask annotations.

Collectively, these innovations enable BoxInst to yield high-quality instance segmentations without the need for full mask annotations, leveraging the strengths of the CondInst framework.

Experimental Results

Evaluations of BoxInst demonstrate its efficacy on significant benchmarks such as COCO and Pascal VOC. For instance, the method achieves a mask Average Precision (AP) of 33.2% on the COCO dataset with a ResNet-101 backbone, which is remarkably competitive given the lack of mask annotations during training. This performance is within striking distance of fully supervised methods and highlights BoxInst's capacity to narrow the performance gap effectively. Additionally, BoxInst outperforms many recent methods, including both weakly and fully supervised approaches, on the benchmark datasets.

Theoretical and Practical Implications

The BoxInst method fundamentally shifts the paradigm in instance segmentation by reducing the dependency on exhaustive annotations, thereby democratizing access to effective segmentation techniques for broader range of applications. The technique’s reduced annotation complexity promises to facilitate applications where resources are limited or rapid annotation is necessary.

Moreover, BoxInst augments its practical applicability to semi-supervised learning settings, aiding generalization across unseen categories by leveraging partial mask annotations where available. This positions BoxInst as a robust framework adaptable to varied annotation scenarios, enhancing its utility across different domains.

Future Directions

The introduction of BoxInst heralds a potential new avenue for further research in instance segmentation without detailed annotations. Future research might investigate extending BoxInst’s principles to even broader contexts, possibly involving dynamic or video object segmentation. Additionally, refinement in the pairwise affinity strategies and integration with LLMs might yield enhancements in both interpretability and performance.

In conclusion, BoxInst contributes a significant advancement in overcoming the annotation bottleneck in instance segmentation, providing a scalable and flexible solution that aligns well with the pressing needs of both academic and applied sectors in computer vision.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Zhi Tian (68 papers)
Chunhua Shen (404 papers)
Xinlong Wang (56 papers)
Hao Chen (1005 papers)

Citations (198)

View on Semantic Scholar