Weakly-Supervised Semantic Segmentation by Iteratively Mining Common Object Features (1806.04659v1)

Published 12 Jun 2018 in cs.CV

Abstract: Weakly-supervised semantic segmentation under image tags supervision is a challenging task as it directly associates high-level semantic to low-level appearance. To bridge this gap, in this paper, we propose an iterative bottom-up and top-down framework which alternatively expands object regions and optimizes segmentation network. We start from initial localization produced by classification networks. While classification networks are only responsive to small and coarse discriminative object regions, we argue that, these regions contain significant common features about objects. So in the bottom-up step, we mine common object features from the initial localization and expand object regions with the mined features. To supplement non-discriminative regions, saliency maps are then considered under Bayesian framework to refine the object regions. Then in the top-down step, the refined object regions are used as supervision to train the segmentation network and to predict object masks. These object masks provide more accurate localization and contain more regions of object. Further, we take these object masks as initial localization and mine common object features from them. These processes are conducted iteratively to progressively produce fine object masks and optimize segmentation networks. Experimental results on Pascal VOC 2012 dataset demonstrate that the proposed method outperforms previous state-of-the-art methods by a large margin.

Authors (4)

Xiang Wang (279 papers)
Shaodi You (36 papers)
Xi Li (198 papers)
Huimin Ma (44 papers)

Citations (295)

View on Semantic Scholar

Summary

Weakly-Supervised Semantic Segmentation by Iteratively Mining Common Object Features

The paper presents an innovative approach to a weakly-supervised semantic segmentation problem with the aim of achieving pixel-level object segmentation using only image-level tags. This challenge arises due to the difficulty of associating high-level semantic concepts with low-level visual appearances in the absence of precise pixel-wise annotations. The proposed method, termed as Mining Common Object Features (MCOF), adopts an iterative framework to progressively refine object regions and enhance segmentation accuracy.

Methodology

The methodology of the paper is structured in an iterative framework comprising bottom-up and top-down steps, leveraging the idea of Mining Common Object Features (MCOF):

Bottom-Up Step: The initial object localization is obtained via Classification Activation Maps (CAMs) from a trained classification network, leading to coarse estimations of object regions. These regions, while coarse, contain discriminative features. The authors propose to iteratively mine common object features from these initial seeds, using them to expand object regions. The expansion is further refined with a Bayesian framework to incorporate saliency maps, resulting in more comprehensive object regions.
Top-Down Step: The refined object regions serve as input to train a semantic segmentation network. The predicted object masks offer improved localization, which becomes the new seed for the next iteration. This cyclical process of refining regions and training the network is repeated multiple times to step-by-step improve the precision of object masks.
Saliency-Guided Refinement: This innovation specifically targets non-discriminative regions that are overlooked by the initial seeds. The initial mined regions are supplemented by saliency map information to ensure a more complete coverage of the object, crucially only applied in the first iteration to prevent noise intrusion from potential saliency map inaccuracies.

Experimental Results

The model's performance was evaluated on the PASCAL VOC 2012 segmentation dataset. The proposed approach demonstrated significant improvements over existing state-of-the-art methods. Using VGG16 as the base for the segmentation network yielded an mIoU of 56.2% on the validation set, while the use of ResNet101 achieved a superior score of 60.3%. This demonstrates the ability of MCOF to significantly bridge the gap between high-level semantic knowledge and low-level visual details under weak supervision.

Implications and Future Directions

The method’s reliance on an iterative schema allows for systematic correction and enhancement of object localization, which is effective even when starting with minimal supervision. This key advantage enables the proposed model to leverage large-scale yet weakly-labeled datasets efficiently. The MCOF framework's compatibility with fully-supervised architectures could streamline its integration into existing segmentation pipelines, promising downstream applications in domains requiring rapid annotation at scale, such as autonomous driving or medical imaging.

The exploration of more advanced saliency models and extending the framework to handle dynamic environments or incorporate video sequences are promising directions. Furthermore, leveraging the strengths of semi-supervised techniques by incorporating a small set of fully-annotated images could further enhance the model's performance, presenting another avenue for future exploration. This work lays a robust foundation for ongoing progress in weakly-supervised learning methods.

PDF Markdown

Related Papers

Find Related Papers