DiscoBox: Weakly Supervised Instance Segmentation and Semantic Correspondence from Box Supervision (2105.06464v2)

Published 13 May 2021 in cs.CV and cs.LG

Abstract: We introduce DiscoBox, a novel framework that jointly learns instance segmentation and semantic correspondence using bounding box supervision. Specifically, we propose a self-ensembling framework where instance segmentation and semantic correspondence are jointly guided by a structured teacher in addition to the bounding box supervision. The teacher is a structured energy model incorporating a pairwise potential and a cross-image potential to model the pairwise pixel relationships both within and across the boxes. Minimizing the teacher energy simultaneously yields refined object masks and dense correspondences between intra-class objects, which are taken as pseudo-labels to supervise the task network and provide positive/negative correspondence pairs for dense constrastive learning. We show a symbiotic relationship where the two tasks mutually benefit from each other. Our best model achieves 37.9% AP on COCO instance segmentation, surpassing prior weakly supervised methods and is competitive to supervised methods. We also obtain state of the art weakly supervised results on PASCAL VOC12 and PF-PASCAL with real-time inference.

View on arXiv

Authors (8)

Shiyi Lan (38 papers)
Zhiding Yu (94 papers)
Christopher Choy (14 papers)
Subhashree Radhakrishnan (7 papers)
Guilin Liu (78 papers)
Yuke Zhu (134 papers)
Larry S. Davis (98 papers)
Anima Anandkumar (236 papers)

Citations (65)

View on Semantic Scholar

Summary

DiscoBox: Weakly Supervised Instance Segmentation and Semantic Correspondence from Box Supervision

"DiscoBox: Weakly Supervised Instance Segmentation and Semantic Correspondence from Box Supervision" introduces a novel framework that harmonizes instance segmentation and semantic correspondence tasks utilizing bounding box supervision. This paper addresses the persistent challenge of requiring costly annotations in both segmentation and correspondence by proposing a weakly-supervised approach that can scale efficiently with data scarcity.

The DiscoBox framework is structured around a teacher model that integrates pairwise and cross-image potentials through a structured energy model. This model assists in refining object masks and establishing dense correspondences between intra-class objects. The framework follows a self-ensembling approach where instance segmentation and semantic correspondence mutually benefit from each other. The refined outputs act as pseudo-labels, which guide the task network and enable dense contrastive learning.

Some notable numerical results include DiscoBox achieving 37.9\% Average Precision (AP) on the COCO dataset, surpassing prior weakly-supervised methods and showing competitive performance against supervised approaches like Mask R-CNN. Moreover, DiscoBox also demonstrated state-of-the-art results on weakly supervised evaluation metrics for PASCAL VOC12 and PF-PASCAL.

The paper offers several contributions, including the unified framework enabling simultaneous instance segmentation and semantic correspondence. The novel self-ensembling model imbues structured inductive bias, and exploits both intra- and cross-image self-supervision for enhanced task performance. This symbiotic relationship ensures effective localization while leveraging semantic parts for better object understanding.

The implications of this research are significant. It highlights the possibility of eliminating mask labels in future instance segmentation problems, thereby reducing reliance on labeled data. DiscoBox's efficient design promotes its use in numerous downstream applications, particularly those concerning 3D geometric tasks.

Future developments may extend DiscoBox's principles to address more complex scenarios in AI and computer vision. The framework could evolve towards integrating further weak supervision techniques or adapting to unsupervised learning paradigms, enhancing its applicability across diverse datasets and environments. The approach sets a precedent for leveraging minimal supervision in tasks traditionally reliant on extensive annotation, opening avenues for research aligned with efficiency and scalability in data annotation.

Related Papers

Find Related Papers

YouTube

Show All Videos