Multi-Evidence Filtering and Fusion for Multi-Label Classification, Object Detection and Semantic Segmentation Based on Weakly Supervised Learning (1802.09129v1)

Published 26 Feb 2018 in cs.CV, cs.AI, cs.LG, and stat.ML

Abstract: Supervised object detection and semantic segmentation require object or even pixel level annotations. When there exist image level labels only, it is challenging for weakly supervised algorithms to achieve accurate predictions. The accuracy achieved by top weakly supervised algorithms is still significantly lower than their fully supervised counterparts. In this paper, we propose a novel weakly supervised curriculum learning pipeline for multi-label object recognition, detection and semantic segmentation. In this pipeline, we first obtain intermediate object localization and pixel labeling results for the training images, and then use such results to train task-specific deep networks in a fully supervised manner. The entire process consists of four stages, including object localization in the training images, filtering and fusing object instances, pixel labeling for the training images, and task-specific network training. To obtain clean object instances in the training images, we propose a novel algorithm for filtering, fusing and classifying object instances collected from multiple solution mechanisms. In this algorithm, we incorporate both metric learning and density-based clustering to filter detected object instances. Experiments show that our weakly supervised pipeline achieves state-of-the-art results in multi-label image classification as well as weakly supervised object detection and very competitive results in weakly supervised semantic segmentation on MS-COCO, PASCAL VOC 2007 and PASCAL VOC 2012.

Authors (3)

Weifeng Ge (29 papers)
Sibei Yang (61 papers)
Yizhou Yu (148 papers)

Citations (188)

View on Semantic Scholar

Summary

The paper introduces a four-stage curriculum pipeline that leverages image-level labels for refined object localization and pixel-wise predictions.
It employs multi-evidence filtering and fusion using metric learning and density-based clustering to clean and combine noisy detections.
Experimental results demonstrate significant improvements in mIoU and mAP on datasets like MS-COCO and PASCAL VOC.

Multi-Evidence Filtering and Fusion for Multi-Label Classification, Object Detection, and Semantic Segmentation Based on Weakly Supervised Learning

The paper presents a novel approach in weakly supervised learning aimed at enhancing the efficacy of multi-label image classification, object detection, and semantic segmentation. The challenges related to the need for detailed annotations in supervised learning are well addressed through a structured pipeline that leverages image-level labels to produce more precise object localization and pixel-wise labeling results.

Key Contributions and Methodology

The authors introduce a four-stage curriculum learning pipeline. This approach methodically scales down the difficulties of learning tasks, accomplishing them through incrementally complex stages:

Object Localization: This initial phase involves generating intermediate results using both bottom-up and top-down algorithms to obtain object proposals. A combination of metrics such as object heatmaps and attention maps is employed to maximize the recall of object instances.
Multi-Evidence Filtering and Fusion: The integration of metric learning and density-based clustering facilitates the filtration of noisy detection results, resulting in a cleaner dataset for subsequent analysis. By incorporating multiple solution mechanisms, complementary subsets of information are fused together, offering a more comprehensive understanding of each image.
Pixel Labeling: This stage focuses on integrating the collected intermediate information to produce precise pixel-wise probability maps, which inform a fully convolutional network, subsequently applied to training images for deriving pixel-level labels.
Task-specific Deep Network Training: Finally, the pipeline shifts towards training deep networks for specific tasks using the refined outputs from earlier stages. This fully supervised network training relies on cleaned and fused instances obtained from the multi-evidence approach.

Experimental Results

The experiments conducted demonstrate the effectiveness of this pipeline, achieving state-of-the-art outcomes in various tasks on prominent datasets such as MS-COCO, PASCAL VOC 2007, and PASCAL VOC 2012. The improvements of the proposed pipeline over existing approaches manifest in significantly higher mean Intersection-Over-Union (mIoU) and mean Average Precision (mAP) metrics across the evaluated datasets.

Implications and Future Directions

The introduction of a multi-evidence filtering and fusion strategy holds promise for the advancement of weakly supervised learning. By transforming image-level annotations to serve fully supervised tasks, the approach efficiently alleviates the high cost of detailed annotations typically required. This paper indicates a progression towards more practical and economical use of deep learning in tasks demanding exhaustive annotation.

The potential future developments may explore streamlining the processes involved in the pipeline, aiming for real-time application capabilities. Furthermore, expanding the use of diverse datasets could refine the adaptability and robustness of the pipeline.

In conclusion, this research significantly contributes to the field of weakly supervised learning by introducing a robust, pipeline-oriented approach that takes a step closer to reconciling the gap between weak and fully supervised methodologies in computer vision tasks.

PDF Markdown