Weakly Supervised Object Localization with Multi-fold Multiple Instance Learning (1503.00949v3)

Published 3 Mar 2015 in cs.CV

Abstract: Object category localization is a challenging problem in computer vision. Standard supervised training requires bounding box annotations of object instances. This time-consuming annotation process is sidestepped in weakly supervised learning. In this case, the supervised information is restricted to binary labels that indicate the absence/presence of object instances in the image, without their locations. We follow a multiple-instance learning approach that iteratively trains the detector and infers the object locations in the positive training images. Our main contribution is a multi-fold multiple instance learning procedure, which prevents training from prematurely locking onto erroneous object locations. This procedure is particularly important when using high-dimensional representations, such as Fisher vectors and convolutional neural network features. We also propose a window refinement method, which improves the localization accuracy by incorporating an objectness prior. We present a detailed experimental evaluation using the PASCAL VOC 2007 dataset, which verifies the effectiveness of our approach.

Citations (428)

View on Semantic Scholar

Summary

The paper's main contribution is introducing a multi-fold MIL framework that avoids premature convergence and leverages high-dimensional features for superior localization.
It employs a window refinement method based on objectness priors to adjust candidate windows and better capture true object boundaries.
The approach significantly reduces annotation costs while achieving competitive localization performance on the PASCAL VOC 2007 dataset.

Weakly Supervised Object Localization: Insights from Multi-fold Multiple Instance Learning

The paper under review explores a significant challenge in computer vision—object category localization—without resorting to costly, labor-intensive tasks like bounding box annotations. The authors propose a weakly supervised learning (WSL) framework capitalizing on multi-fold multiple instance learning (MIL) to sidestep these obstacles. This approach focuses on employing binary category labels to supervise learning, thus eschewing explicit localization information.

Methodology Overview

The core contributions lie in two innovative algorithms: multi-fold MIL and a window refinement method based on objectness priors. Both these components are crucial in achieving meaningful localization in scenarios where only weak supervision is available.

Multi-fold MIL: The authors' approach partitions the dataset into multiple folds—a variation akin to cross-validation—and iteratively trains the detector while ensuring localization on one fold is performed using detectors trained on others. This avoids premature convergence towards incorrect local optima, especially when deploying high-dimensional descriptors such as Fisher vectors (FVs) and convolutional neural network (CNN) features.
Window Refinement Method: By integrating an additional layer of refinement using category-independent objectness measures, the proposal significantly boosts localization accuracy. This process is inspired by the objectness measures to ensure candidate windows better align with object boundaries.

Experimental Evaluation

The proposed framework was rigorously tested on the PASCAL VOC 2007 dataset. Results demonstrate that the multi-fold MIL dramatically outperforms traditional MIL techniques in localizing objects using only high-dimensional features, evidenced by improved CorLoc scores and Average Precision (AP). Particularly, the blend of FV and CNN features shows substantial gains, showcasing the complementary nature of these descriptors.

Implications and Comparisons

The implications of this research extend to reducing annotation costs significantly in practical applications while maintaining robust performance. The multi-fold MIL approach presents a viable alternative for real-world applications where data labeling is incomplete or where a large corpus of internet-tagged images could be harnessed without precise annotations.

In comparison to state-of-the-art methods, the proposed approach achieves competitive performance, especially when CNN-based representations are included. The results indicate the potency of leveraging existing high-level image structures within pre-trained CNNs to enhance WSL approaches, thereby illuminating a path toward more transfer-efficient learning models in computer vision.

Future Directions

The research opens many avenues for future exploration, such as the integration of transfer learning mechanisms that might further amplify WSL efficacy. Additionally, optimizing the computational complexity of multi-fold approaches through parallelization or efficient data partitioning strategies could render these methods more scalable for extensive datasets.

In conclusion, this paper contributes significant advancements in the field of WSL for object detection, emphasizing the need for sophisticated learning techniques where training data is incomplete or imperfectly labeled. The results underpin the vital role that nuanced, high-dimensional feature representations can play in mitigating the trade-offs necessitated by weak supervision.

PDF Markdown