STC: A Simple to Complex Framework for Weakly-supervised Semantic Segmentation (1509.03150v2)

Published 10 Sep 2015 in cs.CV

Abstract: Recently, significant improvement has been made on semantic object segmentation due to the development of deep convolutional neural networks (DCNNs). Training such a DCNN usually relies on a large number of images with pixel-level segmentation masks, and annotating these images is very costly in terms of both finance and human effort. In this paper, we propose a simple to complex (STC) framework in which only image-level annotations are utilized to learn DCNNs for semantic segmentation. Specifically, we first train an initial segmentation network called Initial-DCNN with the saliency maps of simple images (i.e., those with a single category of major object(s) and clean background). These saliency maps can be automatically obtained by existing bottom-up salient object detection techniques, where no supervision information is needed. Then, a better network called Enhanced-DCNN is learned with supervision from the predicted segmentation masks of simple images based on the Initial-DCNN as well as the image-level annotations. Finally, more pixel-level segmentation masks of complex images (two or more categories of objects with cluttered background), which are inferred by using Enhanced-DCNN and image-level annotations, are utilized as the supervision information to learn the Powerful-DCNN for semantic segmentation. Our method utilizes $40$K simple images from Flickr.com and 10K complex images from PASCAL VOC for step-wisely boosting the segmentation network. Extensive experimental results on PASCAL VOC 2012 segmentation benchmark well demonstrate the superiority of the proposed STC framework compared with other state-of-the-arts.

Citations (539)

View on Semantic Scholar

Summary

The paper introduces a novel three-phase STC framework that progressively refines DCNN performance for semantic segmentation using weak supervision from image-level labels.
The methodology leverages an initial training with saliency maps, followed by enhanced and powerful refinement stages to correct and improve segmentation predictions.
Empirical results on the PASCAL VOC benchmark demonstrate that STC outperforms previous weakly-supervised methods, offering a cost-effective alternative to pixel-level supervision.

A Weakly-Supervised Framework for Semantic Segmentation

The paper "Simple to Complex: A Weakly-supervised Framework for Semantic Segmentation" introduces a framework known as STC, designed to address challenges in semantic segmentation using weak supervision. The framework capitalizes on the functionality of deep convolutional neural networks (DCNNs) while mitigating the extensive costs associated with acquiring pixel-level segmentation annotations.

The STC framework operates by leveraging image-level annotations to train DCNNs for semantic segmentation. It adopts a step-wise strategy that progressively learns from simple to complex scenarios. The approach is executed in three major phases: Initial-DCNN, Enhanced-DCNN, and Powerful-DCNN, each building upon the previous one to refine segmentation capabilities.

Methodology

Initial-DCNN Training:
- The process begins with generating saliency maps for simple images, those with a singular dominant object and clear background, using existing bottom-up saliency detection techniques.
- Initial-DCNN is trained using these saliency maps as supervision, incorporating a novel multi-label cross-entropy loss function. This allows for understanding which foreground pixels correspond with specific semantic labels.
Enhanced-DCNN Learning:
- The second phase uses the segmentation predictions from Initial-DCNN to further refine the model.
- These predictions are paired with the original image labels to correct potentially false predictions, ensuring more accurate segmentation outcomes.
- Enhanced-DCNN is then trained with these refined masks to better handle complex visual patterns.
Powerful-DCNN Training:
- Finally, the Enhanced-DCNN is used to infer pixel-level masks for complex images—those containing multiple objects and complex backgrounds.
- These inferred masks, once combined with image-level labels, serve as the training base for Powerful-DCNN, culminating in a model capable of sophisticated segmentation tasks.

The paper reports that the framework was trained using 40,000 simple images sourced from Flickr, alongside 10,000 complex images from the PASCAL VOC dataset. The performance of the framework was validated on the PASCAL VOC 2012 segmentation benchmark, showcasing a notable increase in mIoU scores compared to prior methods leveraging similar levels of supervision.

Results and Implications

The authors presented empirical results indicating that the STC framework outperformed existing weakly-supervised approaches. The systematic STC method manages to bridge the gap between weakly and fully supervised segmentation methodologies through iterative enhancement of its segmentation networks.

The implications of this work are twofold:

Practical: The STC framework offers a cost-effective means to train semantic segmentation models without requiring extensive pixel-level annotations.
Theoretical: This work provides insights into leveraging hierarchical learning, progressing from simpler to more complex data structures, which can inspire further research into scalable weakly-supervised learning paradigms.

Future Directions

Looking forward in AI and computer vision, the principles set forth by the STC framework could be expanded. Potential developments could integrate more sophisticated saliency detection methods, or explore iterative training processes on even larger-scale datasets with minimal supervision. Furthermore, adaptive strategies for handling various levels of image complexity could also be investigated to enhance generalization capabilities across diverse domains.

The STC framework exemplifies a forward step in semantic segmentation, illustrating the value of weak supervision and laying the groundwork for more efficient and effective learning strategies in the broader field of computer vision.

PDF Markdown